Spectral analysis of matrix scaling and operator scaling

Tsz Chiu Kwok; Lap Chi Lau; Akshay Ramachandran

arXiv:1904.03213·cs.DS·April 9, 2019

Spectral analysis of matrix scaling and operator scaling

Tsz Chiu Kwok, Lap Chi Lau, Akshay Ramachandran

PDF

TL;DR

This paper provides a spectral analysis of matrix and operator scaling, showing linear convergence of gradient flows under spectral gap conditions, with implications for various applications in mathematics and quantum information.

Contribution

It introduces a spectral gap condition that guarantees linear convergence of gradient methods for matrix and operator scaling, and derives bounds relevant for multiple applications.

Findings

01

Gradient flow converges linearly with spectral gap

02

Bounds on condition number and capacity derived

03

Applications include expander graphs and quantum information

Abstract

We present a spectral analysis for matrix scaling and operator scaling. We prove that if the input matrix or operator has a spectral gap, then a natural gradient flow has linear convergence. This implies that a simple gradient descent algorithm also has linear convergence under the same assumption. The spectral gap condition for operator scaling is closely related to the notion of quantum expander studied in quantum information theory. The spectral analysis also provides bounds on some important quantities of the scaling problems, such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of scaling problems, including matrix scaling on expander graphs, permanent lower bounds on random matrices, the Paulsen problem on random frames, and Brascamp-Lieb constants on random operators. In some…

Equations663

Φ_{A} (X) = i = 1 \sum k A_{i} X A_{i}^{*},

Φ_{A} (X) = i = 1 \sum k A_{i} X A_{i}^{*},

(1 - ϵ) \frac{s ( A )}{m} I_{m} ⪯ i = 1 \sum k A_{i} A_{i}^{*} ⪯ (1 + ϵ) \frac{s ( A )}{m} I_{m} and (1 - ϵ) \frac{s ( A )}{n} I_{n} ⪯ i = 1 \sum k A_{i}^{*} A_{i} ⪯ (1 + ϵ) \frac{s ( A )}{n} I_{n},

(1 - ϵ) \frac{s ( A )}{m} I_{m} ⪯ i = 1 \sum k A_{i} A_{i}^{*} ⪯ (1 + ϵ) \frac{s ( A )}{m} I_{m} and (1 - ϵ) \frac{s ( A )}{n} I_{n} ⪯ i = 1 \sum k A_{i}^{*} A_{i} ⪯ (1 + ϵ) \frac{s ( A )}{n} I_{n},

i = 1 \sum k (L A_{i} R) (L A_{i} R)^{*} = \frac{I _{m}}{m} and i = 1 \sum k (L A_{i} R)^{*} (L A_{i} R) = \frac{I _{n}}{n},

i = 1 \sum k (L A_{i} R) (L A_{i} R)^{*} = \frac{I _{m}}{m} and i = 1 \sum k (L A_{i} R)^{*} (L A_{i} R) = \frac{I _{n}}{n},

Δ (A) = \frac{1}{m} s (A) \cdot I_{m} - m i = 1 \sum k A_{i} A_{i}^{*}_{F}^{2} + \frac{1}{n} s (A) \cdot I_{n} - n i = 1 \sum k A_{i}^{*} A_{i}_{F}^{2} .

Δ (A) = \frac{1}{m} s (A) \cdot I_{m} - m i = 1 \sum k A_{i} A_{i}^{*}_{F}^{2} + \frac{1}{n} s (A) \cdot I_{n} - n i = 1 \sum k A_{i}^{*} A_{i}_{F}^{2} .

Δ (B) = \frac{1}{m} i = 1 \sum m (s - m r_{i})^{2} + \frac{1}{n} j = 1 \sum n (s - n c_{j})^{2},

Δ (B) = \frac{1}{m} i = 1 \sum m (s - m r_{i})^{2} + \frac{1}{n} j = 1 \sum n (s - n c_{j})^{2},

\frac{d}{d t} A_{i} := (s (A) \cdot I_{m} - m j = 1 \sum k A_{j} A_{j}^{*}) A_{i} + A_{i} (s (A) \cdot I_{n} - n j = 1 \sum k A_{j}^{*} A_{j}) for 1 \leq i \leq k .

\frac{d}{d t} A_{i} := (s (A) \cdot I_{m} - m j = 1 \sum k A_{j} A_{j}^{*}) A_{i} + A_{i} (s (A) \cdot I_{n} - n j = 1 \sum k A_{j}^{*} A_{j}) for 1 \leq i \leq k .

{\frac{d}{dt}}B_{ij}=2\big{(}(s-mr_{i})+(s-nc_{j})\big{)}\cdot B_{ij}.

{\frac{d}{dt}}B_{ij}=2\big{(}(s-mr_{i})+(s-nc_{j})\big{)}\cdot B_{ij}.

M_{A} := i = 1 \sum k A_{i} \otimes A_{i},

M_{A} := i = 1 \sum k A_{i} \otimes A_{i},

σ_{2} (M_{A}) \leq (1 - λ) \frac{s ( A )}{mn},

σ_{2} (M_{A}) \leq (1 - λ) \frac{s ( A )}{mn},

\frac{s ( A )}{mn} \leq σ_{1} (M_{A}) \leq (1 + ϵ) \frac{s ( A )}{mn},

\frac{s ( A )}{mn} \leq σ_{1} (M_{A}) \leq (1 + ϵ) \frac{s ( A )}{mn},

σ_{2} (B) \leq (1 - λ) \frac{s ( B )}{mn} .

σ_{2} (B) \leq (1 - λ) \frac{s ( B )}{mn} .

Δ^{(t)} \leq Δ^{(0)} e^{- λ s^{(0)} t} for any t \geq 0.

Δ^{(t)} \leq Δ^{(0)} e^{- λ s^{(0)} t} for any t \geq 0.

κ (L) \leq 1 + O (\frac{ϵ lo g m}{λ}) and κ (R) \leq 1 + O (\frac{ϵ lo g m}{λ}) .

κ (L) \leq 1 + O (\frac{ϵ lo g m}{λ}) and κ (R) \leq 1 + O (\frac{ϵ lo g m}{λ}) .

cap (A) := X ≻ 0 in f \frac{m det ( \sum _{i = 1}^{k} A _{i} X A _{i}^{*} ) ^{1/ m}}{det ( X ) ^{1/ n}} .

cap (A) := X ≻ 0 in f \frac{m det ( \sum _{i = 1}^{k} A _{i} X A _{i}^{*} ) ^{1/ m}}{det ( X ) ^{1/ n}} .

{\rm cap}(B):=\inf_{x\in\mathbb{R}^{n}:x>0}\frac{m\Big{(}\prod_{i=1}^{m}(Bx)_{i}\Big{)}^{1/m}}{\left(\prod_{j=1}^{n}x_{j}\right)^{1/n}}.

{\rm cap}(B):=\inf_{x\in\mathbb{R}^{n}:x>0}\frac{m\Big{(}\prod_{i=1}^{m}(Bx)_{i}\Big{)}^{1/m}}{\left(\prod_{j=1}^{n}x_{j}\right)^{1/n}}.

s (A) \geq cap (A) \geq (1 - mn ϵ) s (A) .

s (A) \geq cap (A) \geq (1 - mn ϵ) s (A) .

s (A) \geq cap (A) \geq (1 - \frac{4 ϵ ^{2}}{λ}) s (A) .

s (A) \geq cap (A) \geq (1 - \frac{4 ϵ ^{2}}{λ}) s (A) .

per (B) \geq exp (- n (1 + Θ (\frac{ϵ ^{2}}{λ}))) .

per (B) \geq exp (- n (1 + Θ (\frac{ϵ ^{2}}{λ}))) .

(1 - ϵ) I_{d} ⪯ i = 1 \sum n u_{i} u_{i}^{*} ⪯ (1 + ϵ) I_{d} and (1 - ϵ) \frac{d}{n} \leq ∥ u_{i} ∥_{2}^{2} \leq (1 + ϵ) \frac{d}{n} for 1 \leq i \leq n,

(1 - ϵ) I_{d} ⪯ i = 1 \sum n u_{i} u_{i}^{*} ⪯ (1 + ϵ) I_{d} and (1 - ϵ) \frac{d}{n} \leq ∥ u_{i} ∥_{2}^{2} \leq (1 + ϵ) \frac{d}{n} for 1 \leq i \leq n,

i \neq = j max ⟨ v_{i}, v_{j} ⟩^{2} \leq O (\frac{lo g ^{3} d}{d}) .

i \neq = j max ⟨ v_{i}, v_{j} ⟩^{2} \leq O (\frac{lo g ^{3} d}{d}) .

\int_{x\in\mathbb{R}^{n}}\prod_{j=1}^{m}\Big{(}f_{j}(B_{j}x)\Big{)}^{p_{j}}dx\leq C\prod_{j=1}^{m}\left(\int_{x_{j}\in\mathbb{R}^{n_{j}}}f_{j}(x_{j})dx_{j}\right)^{p_{j}}.

\int_{x\in\mathbb{R}^{n}}\prod_{j=1}^{m}\Big{(}f_{j}(B_{j}x)\Big{)}^{p_{j}}dx\leq C\prod_{j=1}^{m}\left(\int_{x_{j}\in\mathbb{R}^{n_{j}}}f_{j}(x_{j})dx_{j}\right)^{p_{j}}.

1 \leq BL (B, p) \leq (1 - \frac{4 ϵ ^{2}}{λ})^{- n /2} \leq exp (Θ (\frac{n ϵ ^{2}}{λ})) .

1 \leq BL (B, p) \leq (1 - \frac{4 ϵ ^{2}}{λ})^{- n /2} \leq exp (Θ (\frac{n ϵ ^{2}}{λ})) .

cap (A) = x \in R^{n} : x > 0 sup \frac{d ( det ( \sum _{j = 1}^{m} x _{j} u _{j} u _{j}^{*} ) ) ^{1/ d}}{( \prod _{j = 1}^{m} x _{j} ) ^{1/ m}},

cap (A) = x \in R^{n} : x > 0 sup \frac{d ( det ( \sum _{j = 1}^{m} x _{j} u _{j} u _{j}^{*} ) ) ^{1/ d}}{( \prod _{j = 1}^{m} x _{j} ) ^{1/ m}},

- \frac{1}{4} \frac{d}{d t} Δ = i = 1 \sum m (s - m r_{i})^{2} r_{i} + j = 1 \sum n (s - n c_{j})^{2} c_{j} + 2 i = 1 \sum m j = 1 \sum n (s - m r_{i}) (s - n c_{j}) B_{ij},

- \frac{1}{4} \frac{d}{d t} Δ = i = 1 \sum m (s - m r_{i})^{2} r_{i} + j = 1 \sum n (s - n c_{j})^{2} c_{j} + 2 i = 1 \sum m j = 1 \sum n (s - m r_{i}) (s - n c_{j}) B_{ij},

Φ_{A} (Y) = i = 1 \sum k A_{i} Y A_{i}^{*} and Φ_{A}^{*} (X) = i = 1 \sum k A_{i}^{*} X A_{i},

Φ_{A} (Y) = i = 1 \sum k A_{i} Y A_{i}^{*} and Φ_{A}^{*} (X) = i = 1 \sum k A_{i}^{*} X A_{i},

M_{A} \cdot vec (Y) = vec (Φ (Y)),

M_{A} \cdot vec (Y) = vec (Φ (Y)),

Φ_{M} (mat (y)) = mat (M_{A} \cdot y),

Φ_{M} (mat (y)) = mat (M_{A} \cdot y),

M_{A} = i = 1 \sum k A_{i} \otimes A_{i} .

M_{A} = i = 1 \sum k A_{i} \otimes A_{i} .

σ_{1} (Φ_{A}) := Y \in R^{n \times n} max \frac{∥ Φ ( Y ) ∥ _{F}}{∥ Y ∥ _{F}} = y \in R^{n^{2}} max \frac{∥ M _{A} \cdot y ∥ _{2}}{∥ y ∥ _{2}} = σ_{1} (M_{A}),

σ_{1} (Φ_{A}) := Y \in R^{n \times n} max \frac{∥ Φ ( Y ) ∥ _{F}}{∥ Y ∥ _{F}} = y \in R^{n^{2}} max \frac{∥ M _{A} \cdot y ∥ _{2}}{∥ y ∥ _{2}} = σ_{1} (M_{A}),

σ_{2} (Φ_{A}) := Y \in R^{n \times n}, ⟨ Y, Y_{1} ⟩ = 0 max \frac{∥ Φ ( Y ) ∥ _{F}}{∥ Y ∥ _{F}} = y \in R^{n^{2}}, y ⊥ y_{1} max \frac{∥ M _{A} \cdot y ∥ _{2}}{∥ y ∥ _{2}} = σ_{2} (M_{A}) .

σ_{2} (Φ_{A}) := Y \in R^{n \times n}, ⟨ Y, Y_{1} ⟩ = 0 max \frac{∥ Φ ( Y ) ∥ _{F}}{∥ Y ∥ _{F}} = y \in R^{n^{2}}, y ⊥ y_{1} max \frac{∥ M _{A} \cdot y ∥ _{2}}{∥ y ∥ _{2}} = σ_{2} (M_{A}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Spectral Analysis of Matrix Scaling and Operator Scaling

Tsz Chiu Kwok111Institute for Theoretical Computer Science, Shanghai University of Finance and Economics. Part of the work was done at University of Waterloo as a postdoctoral researcher. Partially supported by NSERC Discovery Grant 2950-120715 and NSERC Accelerator Supplement 2950-120719. Email: [email protected], Lap Chi Lau222School of Computer Science, University of Waterloo. Supported by NSERC Discovery Grant 2950-120715 and NSERC Accelerator Supplement 2950-120719. Email: [email protected], Akshay Ramachandran333School of Computer Science at University of Waterloo. Supported by NSERC Discovery Grant 2950-120715 and NSERC Accelerator Supplement 2950-120719. Email: [email protected]

We present a spectral analysis for matrix scaling and operator scaling. We prove that if the input matrix or operator has a spectral gap, then a natural gradient flow has linear convergence. This implies that a simple gradient descent algorithm also has linear convergence under the same assumption. The spectral gap condition for operator scaling is closely related to the notion of quantum expander studied in quantum information theory.

The spectral analysis also provides bounds on some important quantities of the scaling problems, such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of scaling problems, including matrix scaling on expander graphs, permanent lower bounds on random matrices, the Paulsen problem on random frames, and Brascamp-Lieb constants on random operators. In some applications, the inputs of interest satisfy the spectral condition and we prove significantly stronger bounds than the worst case bounds.

1 Introduction

In the matrix scaling problem, we are given a non-negative matrix $B\in\mathbb{R}^{n\times n}$ , and the goal is to find a left diagonal scaling matrix $L\in\mathbb{R}^{n\times n}$ and a right diagonal scaling matrix $R\in\mathbb{R}^{n\times n}$ such that $LBR$ is doubly stochastic (every row sum and every column sum is one), or report that such scaling matrices do not exist. This problem has been extensively studied in different communities; see [39] for a detailed survey.

The operator scaling problem is a significant generalization of the matrix scaling problem. Given a tuple of $m\times n$ real matrices ${\cal A}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ , a linear operator $\Phi_{{\cal A}}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m\times m}$ is defined as

[TABLE]

where $A_{i}^{*}$ denotes the conjugate transpose of $A_{i}$ which is just the transpose when $A_{i}$ is real. We will simply refer to ${\cal A}$ as an operator. The size of an operator ${\cal A}$ is defined as $s({\cal A}):=\sum_{i=1}^{k}\left\lVert A_{i}\right\rVert_{F}^{2},$ where $\left\lVert\cdot\right\rVert_{F}$ denotes the Frobenius norm of a matrix. An operator ${\cal A}$ is called ${\epsilon}$ -nearly doubly balanced if

[TABLE]

and is called doubly balanced when ${\epsilon}=0$ . The operator scaling problem is defined by Gurvits [29]. The objective is to scale the input operator so that it becomes doubly balanced with size one.

Definition 1.1 (Operator Scaling Problem).

Input:* An operator ${\cal A}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ .*
Output:* A left scaling matrix $L\in\mathbb{R}^{m\times m}$ and a right scaling matrix $R\in\mathbb{R}^{n\times n}$ such that*

[TABLE]

or report that such scaling matrices $L,R$ do not exist.

There is a simple reduction from the matrix scaling problem to the operator scaling problem, by having one matrix $A_{ij}\in\mathbb{R}^{n\times n}$ for each entry $B_{ij}$ with the $(i,j)$ -entry of $A_{ij}$ being $\sqrt{B_{ij}}$ and all other entries zero; see Section 4.1 for details.

The operator scaling problem generalizes matrix scaling and frame scaling and has many applications; see Section 1.4 and Section 4. Much work has been done in analyzing algorithms for these scaling problems and in understanding the scaling solutions and related quantities.

1.1 Previous Algorithms

For matrix scaling, the most well-known algorithm is Sinkhorn’s algorithm [54], which is a simple iterative algorithm that alternatively rescale the rows and rescale the columns. This algorithm is analyzed in [18] and it is shown that the alternating algorithm finds an $\eta$ -nearly doubly stochastic scaling in time polynomial in $n$ and $1/\eta$ .

The alternating scaling algorithm is generalized in [29] for the operator scaling problem. In this algorithm, we alternately find a left scaling matrix $L=(\sum_{i}A_{i}A_{i}^{*})^{-1/2}$ and set $A_{i}\leftarrow LA_{i}$ so that the first condition of doubly balanced is satisfied, and a right scaling matrix $R=(\sum_{i}A_{i}^{*}A_{i})^{-1/2}$ and set $A_{i}\leftarrow A_{i}R$ so that the second condition of doubly balanced is satisfied, and repeat. This alternating algorithm is partially analyzed in [29] and is fully analyzed in [20, 19].

Theorem 1.2 ([54, 18, 20, 19]).

The alternating scaling algorithm returns an $\eta$ -nearly doubly balanced scaling in $O(\operatorname{poly}(n,m,k,1/\eta))$ iterations if such a scaling exists.

This theorem is used in [20, 19] to give the first polynomial time algorithm for computing the non-commutative rank of a symbolic matrix, as it is sufficient to set $\eta$ to be inverse polynomial in $n$ to solve that problem exactly. For some applications, however, faster convergence of $\eta$ is required.

For matrix scaling, there are several algorithms with dependency on $\eta$ being $\log(1/\eta)$ , including the ellipsoid method in [40], the interior point method in [51], and a strongly polynomial time combinatorial algorithm in [47]. The dependency on $n$ in these algorithms is at least $\Omega(n^{7/2})$ even for sparse matrices. Recently, two independent groups [13, 2] developed a fast second order method for matrix scaling, and this method is extended to geodesic convex optimization in [1] for the operator scaling problem.

Theorem 1.3 ([13, 2, 1]).

There is a second order method to return an $\eta$ -nearly doubly balanced scaling in time $O(\operatorname{poly}(n,m,k,\log(1/\eta)))$ for operator scaling, and in time $O(\left\lVert B\right\rVert_{0}\log\kappa\log^{2}(1/\eta))$ for matrix scaling where $\left\lVert B\right\rVert_{0}$ denotes the number of nonzero entries in $B$ and $\kappa$ denotes the condition number of the scaling solution.

For matrix scaling, this theorem can be used to obtain a fast deterministic $e^{-n}$ approximation algorithm for the permanent of a matrix [47]. For operator scaling, this theorem is used to obtain a polynomial time algorithm for an orbit intersection problem in invariant theory [1].

1.2 Gradient Flow

An important quantity in [29, 20, 1] to measure the progress of the algorithms is the $\ell_{2}$ -error of the current solution. Given an operator ${\mathcal{A}}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ , define

[TABLE]

Note that $\Delta({\cal A})=0$ if and only if ${\cal A}$ is doubly balanced. In the matrix scaling problem for general $m\times n$ matrix where the objective is to scale the input matrix $B$ such that every row sum is the same and every column sum is the same, this definition simplifies to

[TABLE]

where $r_{i}$ and $c_{j}$ are the $i$ -th row sum and the $j$ -th column sum of the matrix $B$ , and $s=\sum_{i=1}^{m}\sum_{j=1}^{n}B_{ij}$ is the size of the matrix $B$ .

A continuous version of the alternating algorithm for operator scaling is studied in [45], where both operations are done simultaneously and continuously. The following differential equation describes how ${\cal A}$ changes over time:

[TABLE]

In the matrix case, this continuous scaling algorithm simplifies to

[TABLE]

The continuous operator scaling algorithm is developed to bound the “total movement” of the operator in order to solve the Paulsen problem in [45]. Its convergence rate is shown to be similar to that of the alternating scaling algorithm, with dependency on $\eta$ being $1/\eta$ .

The continuous operator scaling algorithm can be understood as a natural first order method for the operator scaling problem. As we will show in Lemma A.1 in Appendix A, the dynamical system in continuous operator scaling is equivalent to the gradient flow (or continuous gradient descent) that always moves in the direction of minimizing $\Delta({\cal A})$ at each time. This shows a close connection between gradient descent and the alternating algorithm.

This gradient flow was studied in much greater generality in symplectic geometry and algebraic geometry (see [41, 27]). After a long line of work [3, 25, 26, 43, 42], Kirwan proved that the image of the moment map of a Hamiltonian group action on a symplectic manifold is a convex polytope. To prove this, Kirwan uses the norm-square of the moment map (which in our setting is exactly $\Delta({\cal A})$ ), and studies critical points of this function in order to understand the image of the moment map (where a point is critical for $\Delta({\cal A})$ exactly when it is a fixed point of the gradient flow). The current result as well as the result in [45] can be seen as quantitative convergence analyses in the neighborhoods of fixed points of this natural gradient flow in the operator scaling setting. It is an interesting direction to extend our result to the above general setting.

1.3 Contributions

In this paper, we analyze this gradient flow for the operator scaling problem. We identify a natural spectral condition under which the gradient flow converges in time $t=O(\log(1/\eta))$ (corresponding to the number of iterations in the alternating algorithm) where $\eta$ is the output accuracy. The spectral condition is closely related to the notion of “quantum expander” and is satisfied in many random instances. A key feature of our approach is that it also provides bounds on some important mathematical quantities such as the condition number of the scaling solution and the capacity of the matrix and operator. These bounds can be used in various applications of the operator scaling problem to show significantly stronger results for inputs that satisfy the spectral condition such as random matrices and random frames. We remark that the new results in various applications cannot be obtained through previous work (e.g. the fast algorithm for operator scaling in [1]), as the analyses of previous algorithms do not provide mathematical bounds for the condition number of the scaling solution and the operator capacity.

Spectral Condition

We first state the spectral condition in the general operator setting.

Definition 1.4 (Spectral Gap Condition).

Given an operator ${\mathcal{A}}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ , define the $m^{2}\times n^{2}$ matrix

[TABLE]

where $\otimes$ denotes the tensor product. The operator ${\cal A}$ is said to have a $\lambda$ -spectral gap if

[TABLE]

where $\sigma_{2}(M_{{\cal A}})$ is the second largest singular value of $M_{{\cal A}}$ .

Note that the spectral condition can be checked in polynomial time through standard eigenvalue computation.

The matrix $M_{{\cal A}}$ associated with ${\cal A}$ is studied in the quantum information theory literature (e.g. [61]), as the natural matrix representation of the completely positive map $\Phi(X):=\sum_{i}A_{i}XA_{i}^{*}$ defined by ${\cal A}$ . It can be shown that the largest singular value of $M_{{\cal A}}$ satisfies

[TABLE]

when ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced (Lemma 3.6). The spectral gap condition is also studied under the name of “quantum expander” in [7, 35]. We will discuss more about this spectral gap condition in Section 2.1 after some background on quantum information theory is reviewed.

For matrix scaling, given the input matrix $B\in\mathbb{R}^{m\times n}$ , the spectral gap condition is simply

[TABLE]

If we interpret the input matrix $B$ as a weighted undirected bipartite graph, then the spectral gap condition is closely related to the expansion/conductance of the graph. We will explain more about these in Section 1.4.1 and in Section 4.1.

Linear Convergence

We prove that the gradient flow has linear convergence when the input satisfies the spectral gap condition.

Theorem 1.5 (Linear Convergence).

Given an operator ${\cal A}=(A_{1},\ldots,A_{k})$ where each $A_{i}\in\mathbb{R}^{m\times n}$ with $m\leq n$ , if ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced and ${\cal A}$ satisfies the $\lambda$ -spectral gap condition in Definition 1.4 with $\lambda^{2}\geq C{\epsilon}\log m$ for a sufficiently large constant $C$ , then in the gradient flow,

[TABLE]

In particular, the gradient flow converges to a $\eta$ -nearly doubly balanced scaling in time $t=O\left(\frac{1}{\lambda}\log(\frac{m}{\eta})\right)$ , and such a scaling always exists under our assumptions.

By discretizing the gradient flow with step size $\Theta((m+n)^{-2})$ , it follows that a natural gradient descent algorithm returns an $\eta$ -nearly doubly stochastic scaling in polynomial time in the input size and logarithmic in $1/\eta$ , when the input satisfies the spectral gap condition.

Corollary 1.6 (Gradient Descent).

Under the assumptions in Theorem 1.5, there is a gradient descent algorithm to return an $\eta$ -nearly doubly balanced scaling in $O\left(\frac{(n+m)^{2}}{\lambda}\log(\frac{m+n}{\eta})\right)$ iterations.

It is an interesting open question whether the alternating algorithm also has the same convergence rate as the gradient flow under the same assumptions. We believe that the answer is positive but we could not prove it yet.

Condition Number

The condition number of the scaling solutions $L,R$ are defined as $\kappa(L):=\sigma_{\max}(L)/\sigma_{\min}(L)$ where $\sigma_{\max}(L)$ and $\sigma_{\min}(L)$ denote the largest and smallest singular values of $L$ respectively. For matrix scaling, $\kappa(L)$ is simply the ratio between the largest entry and the smallest entry in the diagonal matrix $L$ .

In general, the condition numbers could be exponential in the input size. It is of interest to identify instances with small condition numbers as these are closely related to the performance of matrix/operator scaling algorithms (e.g. Theorem 1.3), but not much is known even in the simpler matrix scaling setting. Kalantari and Khachiyan [40] proved a bound for strictly positive matrices in terms of the ratio of the sum of the entries and the minimum entry. We show that the condition numbers are bounded by a small constant when the input satisfies the spectral gap condition (not necessarily strictly positive).

Theorem 1.7 (Condition Number).

Under the assumptions in Theorem 1.5, the condition number of the scaling solutions $L\in\mathbb{R}^{m\times m}$ and $R\in\mathbb{R}^{n\times n}$ satisfy

[TABLE]

The condition number of the scaling solutions is used in bounding the time complexity of the scaling algorithms using the second order method [1, 13], in analyzing an approximation algorithm for permanent [53], and in bounding the optimal transport cost [14, 52]. We will discuss the implications of Theorem 1.7 to these applications in Section 4.

Operator Capacity

The capacity of an operator ${\cal A}$ is defined by Gurvits [29] as

[TABLE]

The capacity of a matrix $B\in\mathbb{R}^{m\times n}$ has a simpler form (Section 4.1.6) where

[TABLE]

Optimization problems of this form are also studied in functional analysis [5] and in approximation algorithms [50].

In general, when ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced [29, 20, 45], it is proved that

[TABLE]

Using a connection between the convergence rate of the gradient flow and the operator capacity developed in [45], we show a much stronger bound for operators that also satisfy the spectral gap condition.

Theorem 1.8 (Capacity).

Under the assumptions in Theorem 1.5,

[TABLE]

The capacity of an operator is used in bounding the permanent of a matrix [47], the Brascamp-Lieb constant of an operator [21], and the total movement to a nearby doubly balanced operator [45]. We will discuss the implications of Theorem 1.8 to these applications in Section 1.4.

1.4 Applications of Matrix Scaling and Operator Scaling

The matrix scaling and the operator scaling problem has many applications and we will discuss some implications of our results in this section.

1.4.1 Matrix Scaling

In the matrix scaling problem, we are given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , and the goal is to find a left diagonal scaling matrix $L\in\mathbb{R}^{m\times m}$ and a right diagonal scaling matrix $R\in\mathbb{R}^{n\times n}$ such that $LBR$ is doubly balanced (i.e. every row sum is the same and every column sum is the same; see Section 4.1 for definition), or report that such scaling matrices do not exist.

The matrix scaling problem is a special case of the operator scaling problem (Section 4.1.1) and so the spectral analysis also applies. In the case of matrix scaling, the spectral condition in Definition 1.4 is simply $\sigma_{2}(B)\leq(1-\lambda)s(B)/\sqrt{mn}$ (Section 4.1.2). Using Cheeger’s inequality, we show that this spectral gap condition is closely related to the conductance of the weighted bipartite graph associated to $B$ (Section 4.1.3). These imply that many random matrices will satisfy the condition in Theorem 1.5 (Section 4.1.4).

Our results has implications for the matrix scaling problem, e.g. to obtain stronger results for random matrices. For bipartite matching, we show that the gradient flow converges quickly to a fractional perfect matching in an almost regular bipartite expander graph (Section 4.1.5).

Corollary 1.9.

Suppose $G=(X,Y;E)$ is a bipartite graph with $|X|=|Y|$ where each vertex $v$ satisfies $(1-{\epsilon})|E|/|X|\leq\deg(v)\leq(1+{\epsilon})|E|/|X|$ for some ${\epsilon}$ . If the graph conductance $\phi(G)$ satisfies $\phi(G)^{4}\geq C{\epsilon}\log|X|$ for some sufficiently large constant $C$ , then the gradient flow converges to an $\eta$ -nearly perfect fractional matching in time $t=O\left(\frac{1}{\phi^{2}(G)}\log\left(\frac{|X|}{\eta}\right)\right)$ .

For permanent, the Van der Waerden’s conjecture states that the permanent of a doubly stochastic $n\times n$ matrix is at least $n!/n^{n}\geq e^{-n}$ , which is proven in [15, 16, 28]. The capacity lower bound in Theorem 1.8 can be used to prove a Van der Waerden’s type lower bound on the permanent of matrices satisfying the spectral gap condition (not necessarily doubly stochastic).

Corollary 1.10.

If a non-negative matrix $B\in\mathbb{R}^{n\times n}$ is ${\epsilon}$ -nearly doubly balanced with $s(B)=n$ , and $\sigma_{2}(B)\leq 1-\lambda$ with $\lambda^{2}\geq C{\epsilon}\log n$ for some sufficiently large constant $C$ , then

[TABLE]

For example, consider a random matrix $A$ with each entry an independent random variable $A_{ij}=g_{ij}^{2}$ where $g_{ij}$ is sampled from the Gaussian distribution $N(0,\frac{1}{n})$ . The corollary implies that $\operatorname{per}(A)\geq e^{-n}/\operatorname{poly}(n)$ with high probability. This implies a sub-exponential approximation of the permanent for this class of matrices [6]. See Section 4.1.6 for details.

For optimal transportation distance, we can use the condition number result in Theorem 4.1.7 to bound the Sinkhorn distance [14, 52], which is receiving increasing attention in computer vision and machine learning (Section 4.1.7).

The condition number result in Theorem 4.1.7 can also be used to show that the second-order method for matrix scaling [13, 2] as stated in Theorem 1.3 is near linear time in the instances satisfying the spectral gap assumption.

1.4.2 Frame Scaling

In the frame scaling problem, we are given $n$ vectors $u_{1},\ldots,u_{n}\in\mathbb{R}^{d}$ , and the goal is to find a matrix (a linear transformation) $M\in\mathbb{R}^{d\times d}$ such that if we set $v_{i}=Mu_{i}/\left\lVert Mu_{i}\right\rVert_{2}$ then $\sum_{i=1}^{n}v_{i}v_{i}^{*}=I_{d}$ . This problem was studied in communication complexity [17], machine learning [33], and in frame theory [45, 32].

The frame scaling problem is a special case of the operator scaling problem (Section 4.2.1) and so the spectral analysis also applies. In the case of frame scaling, the spectral condition in Definition 1.4 has a nice form (Section 4.2.2): Let $G\in\mathbb{R}^{n\times n}$ be the squared Gram matrix where $G_{ij}=\langle u_{i},u_{j}\rangle^{2}$ . Then the spectral condition is equivalent to $\lambda_{2}(G)\leq(1-\lambda)^{2}s^{2}/(dn)$ where $\lambda_{2}(G)$ is the second largest eigenvalue of $G$ and $s$ is the size of the frame defined as $\sum_{i=1}^{n}\left\lVert u_{i}\right\rVert^{2}$ . We will prove in Section 5 that this condition is satisfied for random frames with high probability.

Theorem 1.11.

If we generate $n$ random unit vectors $u_{1},\ldots,u_{n}\in\mathbb{R}^{d}$ with $n=\Omega(d^{4/3})$ , then the resulting frame is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\ll 1/\log d$ and satisfies the spectral gap condition with constant $\lambda$ with probability at least $0.99$ .

For intuition, suppose each $u_{i}$ is a random unit vector, then the expected value of $G_{ij}=\langle u_{i},u_{j}\rangle^{2}$ for $i\neq j$ is $1/d$ and so the expected matrix $G$ is $J_{n}/d+(d-1)I_{n}/d$ where $J_{n}$ is the $n$ -by- $n$ all-one matrix. The matrix $J_{n}$ has the largest spectral gap, and we expect that a random frame will have its squared Gram matrix $G$ close to $J_{n}/d+(d-1)I_{n}/d$ and thus a large spectral gap. The proof is by a low moment analysis of the trace method commonly used in random matrix theory (Section 5).

One significant implication of our result is the Paulsen problem on random frames. Given a frame $U=(u_{1},\ldots,u_{n})$ where each $u_{i}\in\mathbb{R}^{d}$ satisfying

[TABLE]

the Paulsen problem asks whether there always exists a frame $V=(v_{1},\ldots,v_{n})$ where each $v_{i}\in\mathbb{R}^{d}$ satisfying $\sum_{i=1}^{n}v_{i}v_{i}^{*}=I_{d}$ , $\left\lVert v_{i}\right\rVert_{2}^{2}=d/n$ for $1\leq i\leq n$ , and $\operatorname{{\rm dist}^{2}}(U,V):=\sum_{i=1}^{n}\left\lVert u_{i}-v_{i}\right\rVert_{2}^{2}$ small. It was an open problem whether $\operatorname{{\rm dist}^{2}}(U,V)$ can be bounded by a function independent of the number of vectors $n$ . Recently, this question was answered positively in [45], showing that $\operatorname{{\rm dist}^{2}}(U,V)\leq O(d^{13/2}{\epsilon})$ . This bound is improved to $O(d^{2}{\epsilon})$ by Hamilton and Moitra [32] with a much simpler proof. There are examples showing that $\operatorname{{\rm dist}^{2}}(U,V)\geq\Omega(d{\epsilon})$ , so the upper bound and the lower bound almost match in the worst case.

The Paulsen problem was asked [36] because it is difficult to generate $V$ that satisfies the conditions exactly but easier to generate $U$ that almost satisfies the conditions. But actually not many ways are known to generate $U$ that almost satisfies the conditions with small ${\epsilon}$ , and almost all known constructions are random frames [36, 59]. Even for the few constructions that are deterministic (such as equiangular lines), it is likely that they satisfy the spectral gap assumption. So, for the Paulsen problem, the inputs of interest satisfy the spectral gap assumption, and we can prove a much stronger bound $O(d{\epsilon}^{2})$ that goes beyond the worst case lower bound.

Theorem 1.12.

Let $U=(u_{1},\ldots,u_{n})$ be a random frame with $n=\Omega(d^{4/3})$ , where each $u_{i}\in\mathbb{R}^{d}$ is an independent random vector with $\left\lVert u_{i}\right\rVert_{2}^{2}=d/n$ . Suppose $(1-{\epsilon})I_{d}\preceq\sum_{i=1}^{n}u_{i}u_{i}^{*}\preceq(1+{\epsilon})I_{d}$ . Then, with probability at least $0.99$ , there exists a frame $V=(v_{1},\ldots,v_{n})$ with $\sum_{i=1}^{n}v_{i}v_{i}^{*}=I_{d}$ , $\left\lVert v_{i}\right\rVert_{2}^{2}=d/n$ for $1\leq i\leq n$ , and $\operatorname{{\rm dist}^{2}}(U,V)\leq O(d{\epsilon}^{2})$ .

We also demonstrate how the results in spectral analysis can be used to construct $V$ with the additional property that $|\langle v_{i},v_{j}\rangle|$ is small for $1\leq i\neq j\leq n$ , which is an original motivation for the Paulsen problem (Section 4.2.4).

Theorem 1.13.

For $n=d^{2}$ , there exists a doubly balanced frame $V=(v_{1},\ldots,v_{n})$ where each $v_{i}\in\mathbb{R}^{d}$ with $\left\lVert v_{i}\right\rVert=1$ and

[TABLE]

1.4.3 Operator Scaling

The operator scaling problem was used to compute the Brascamp-Lieb constant [21]. A Brascamp-Lieb datum is specified by an $m$ -tuple ${\bf B}=\{B_{j}:\mathbb{R}^{n}\to\mathbb{R}^{n_{j}}\mid 1\leq j\leq m\}$ of linear transformations and an $m$ -tuple of exponents ${\bf p}=\{p_{1},\ldots,p_{m}\}$ . The Brascamp-Lieb constant ${\rm BL}({\bf B},{\bf p})$ of this datum is defined as the smallest $C$ such that for every $m$ -tuple $\{f_{j}:\mathbb{R}^{n_{j}}\to\mathbb{R}_{\geq 0}\mid 1\leq j\leq m\}$ of non-negative functions which are integrable, we have

[TABLE]

This is a common generalization of many useful inequalities; see [8, 21]. It turns out that the functions $f_{i}$ for which the inequality is tight are density functions of Gaussians [46], and this implies the Brascamp-Lieb constant can be written in a form very similar to the capacity of an operator (see Section 4.3.1). This is used in [21] to compute the Brascamp-Lieb constant through operator scaling.

Using this connection, we can derive upper bounds on the Brascamp-Lieb constant using the capacity lower bound in Theorem 1.8.

Corollary 1.14.

Given a datum $({\bf B},{\bf p})$ with $B_{j}:\mathbb{R}^{n}\to\mathbb{R}^{n_{j}}$ for $1\leq j\leq m$ and $\sum_{j=1}^{m}p_{j}n_{j}=n$ , if $({\bf B},{\bf p})$ is ${\epsilon}$ -nearly geometric and satisfies the $\lambda$ -spectral gap condition with $\lambda^{2}\geq C{\epsilon}\log n$ for some sufficiently large constant $C$ and $\sum_{j=1}^{m}p_{j}\left\lVert B_{j}\right\rVert_{F}^{2}=n$ , then

[TABLE]

An interesting special case of the Brascamp-Lieb inequality is the rank one case $B_{j}=u_{j}^{*}$ where $u_{j}\in\mathbb{R}^{d}$ and $n_{j}=1$ and $p_{j}=d/m$ for $1\leq j\leq m$ which was studied in [5]. In this case, the capacity of the operator ${\cal A}$ from the reduction (Section 4.3.1) is

[TABLE]

which is a form that is also studied in approximation algorithms [50]. Using the results in Section 5 and the above corollary, we can show that if each $u_{i}$ is an independent random unit vector and $m\geq\Omega(d^{4/3})$ , then $m\geq{\rm cap}({\cal A})\geq m\left(1-4d\log d/m\right)$ and $1\leq{\rm BL}({\bf B},{\bf p})\leq d^{\Theta(d)}$ ; see Example 4.27. Note that this is independent of the number of vectors.

The operator scaling algorithm is used in [20, 19] to compute the non-commutative rank of a symbolic matrix. We show in Section 4.3.2 that an operator satisfying the spectral gap condition has full non-commutative rank.

In solving the orbit intersection problem [1], the result of a generalization of the Paulsen problem to the operator setting in [45] was used. As in Theorem 1.12, we prove a much stronger bound in Section 4.3.3 on the squared distance when the operator satisfies the spectral gap condition.

1.5 Techniques

We are not aware of previous work on spectral analysis of matrix scaling and operator scaling. To our knowledge, the results are new even in the well-studied special case of matrix scaling. The closest work in this direction that we are aware of is a recent work by Rudelson, Samorodnitsky and Zeitouni [53], who analyze the condition number of the matrix scaling solution when the matrix satisfies some strong (vertex) expansion property using a combinatorial argument.

In the following, we discuss the previous techniques used in analyzing the continuous operator scaling algorithm, and then discuss the techniques used in this paper.

1.5.1 Comparisons with Previous Techniques

The operator capacity defined by Gurvits [29] was used crucially as a potential function to analyze the discrete operator scaling algorithms in [29, 20] as well as the continuous operator scaling algorithm in [45].

A smoothed analysis of matrix scaling was presented in [45] for solving the Paulsen problem. It was shown that if most of the entries of an $m\times n$ matrix with $m\leq n$ is at least $\sigma^{2}$ for a large enough $\sigma$ , then the continuous matrix scaling algorithm has linear convergence with rate at least $\sigma^{2}n$ . This combinatorial assumption is restrictive and only applies in the matrix scaling setting. Note that the combinatorial assumption implies the spectral gap assumption in Definition 1.4 with $\lambda\geq\Omega(\sigma^{2})$ but not vice versa. Through a reduction from operator capacity to matrix capacity, the smoothed analysis can be extended to the frame setting but the proof was complicated, and it was not known whether the smoothed analysis can be extended to the general operator setting. The main difficulty is that there is no analogous combinatorial condition in the frame setting and in the operator setting to guarantee the linear convergence. This is an illustration of the difference between the matrix case and the noncommutative operator case, in which there is no natural basis to consider. In this paper, we have found a natural spectral condition to guarantee linear convergence directly in the general operator setting. As a consequence, we do not need to go through the operator capacity to analyze the convergence rate of the operator scaling algorithm, which is different from previous analyses. Nonetheless, we can use the linear convergence to prove a lower bound on the operator capacity as was done in [45].

1.5.2 Outline of Spectral Analysis

We illustrate the main ideas of the spectral analysis in the simpler matrix scaling setting and mention how these ideas can be generalized to the operator setting. For gradient descent, a common approach to prove linear convergence is to show that the Hessian matrix has small condition number. Instead, our approach is to directly analyze the change of $\Delta$ . In the matrix scaling setting, it follow from Lemma 4.2.9 in [45] that

[TABLE]

where $B\in\mathbb{R}^{m\times n}$ is the current non-negative matrix, and $s,r_{i},c_{j}$ are the size, the $i$ -th row sum and the $j$ -th column sum of $B$ respectively. We call the first two terms in the right hand side the quadratic terms and the last term the cross term. Our goal is to lower bound their sum by $\lambda s\Delta$ . To do so, we will prove a lower bound on the sum of the quadratic terms, and an upper bound on the absolute value of the cross term.

First, we prove a structural result that the maximum violation of a row and a column will not increase much throughout the continuous matrix scaling algorithm, and then we use this to show that the sum of the quadratic terms is at least $(1-{\epsilon})s\Delta$ for an ${\epsilon}$ -nearly doubly balanced matrix $B$ . Then, we write the cross term as a quadratic form of the matrix $B$ as $\vec{r}B\vec{c}$ , where $\vec{r}\in\mathbb{R}^{m}$ is the vector with the $i$ -th entry being $s-mr_{i}$ and $\vec{c}\in\mathbb{R}^{n}$ is the vector with the $j$ -th entry being $s-nc_{j}$ . The observation is that $\vec{r}\perp\vec{1_{m}}$ and $\vec{c}\perp\vec{1_{n}}$ while $\vec{1_{m}},\vec{1_{n}}$ are close to the first singular vectors of $B$ , so the cross term would be small if there is a spectral gap of the matrix $B$ . By a spectral argument, we can show that the absolute value of the cross term is at most $(1+{\epsilon}-\lambda)s\Delta$ . Combining these two bounds, we can lower bound the convergence rate to be at least $4(\lambda-4{\epsilon})s\Delta$ initially.

To prove that the convergence rate is at least $\lambda s\Delta$ for all time, we need to prove that the spectral gap condition is maintained throughout the continuous matrix scaling algorithm. To do so, we argue through the condition number of the scaling solutions. We use the structural result and the linear convergence to show that the condition number of the scaling solution is small, and then we show that the singular values of the matrix would not change much if we scale the matrix $B$ by diagonal matrices of small condition numbers. Finally, we use an inductive argument to prove that the linear convergence is maintained for all time. The results for condition numbers and capacity follow from the arguments developed and the linear convergence.

The proof for the general operator setting has the same structure, with more involved technical details in some steps. To prove the structural result that the operator norm of the error matrices would not increase much throughout the continuous operator scaling algorithm, we need to use the envelope theorem to bound the maximum eigenvalue and the minimum eigenvalue. To bound the condition number of the scaling solutions, we need to use results from the theory of product integration to analyze the scaling solutions. For readers who are more interested in matrix scaling and/or who would like to understand the spectral analysis in a simpler setting first, we include a self-contained proof for the matrix scaling case in Appendix B even though the matrix scaling result is completely generalized by the operator scaling result.

1.6 Organization

We first review some background about completely positive linear operators and the continuous operator scaling algorithm in Section 2. We then prove the main technical results in Section 3 and show various applications in Section 4. We provide a proof in Section 5 that a random frame satisfies the spectral condition with high probability. In Appendix B, we provide a self-contained proof of Theorem 1.5 in the special case of matrix scaling.

2 Preliminaries

We first review in Section 2.1 some background in quantum information theory about completely positive maps and discuss the spectral gap condition stated in Definition 1.4. Then, we review the known results about the continuous operator scaling algorithm in Section 2.2

2.1 Positive Linear Maps, Matrix Representations, Quantum Expanders

First, we define completely positive linear maps and their natural matrix representation in Section 2.1.1. Then, in Section 2.1.2, we present the spectral gap condition in Definition 1.4 using this language, and compare to the notion of quantum expanders studied in the literature. Finally, we introduce the Choi matrix in Section 2.1.3 and state some facts about tensors and completely positive maps that we will use in our proof.

2.1.1 Completely Positive Linear Map

Given ${\mathcal{A}}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ , it can be used to define a linear map $\Phi:\mathbb{R}^{n\times n}\to\mathbb{R}^{m\times m}$ as

[TABLE]

where $\Phi^{*}:\mathbb{R}^{m\times m}\to\mathbb{R}^{n\times n}$ is the adjoint map so that $\langle X,\Phi(Y)\rangle=\langle\Phi^{*}(X),Y\rangle$ for any $X\in\mathbb{R}^{m\times m}$ and $Y\in\mathbb{R}^{n\times n}$ , where $\langle P,Q\rangle:=\operatorname{tr}(P^{*}Q)=\sum_{i,j}P_{ij}^{*}Q_{ij}$ is the Hilbert-Schmidt inner product.

Definition 2.1 (Completely Positive Map).

A linear map $\Phi$ is positive if $\Phi(Y)\succeq 0$ for every $Y\succeq 0$ , where $Y\succeq 0$ denotes that $Y$ is a positive semidefinite matrix. A linear map $\Phi$ is completely positive if $\Phi\otimes I_{l}$ is positive for every natural number $l\geq 1$ (see [61] for more details).

Theorem 2.2 (Choi [12]).

A linear map $\Phi$ is completely positive if and only if it can be written as the form described in (2.1).

The matrices $A_{1},\ldots,A_{k}$ are called the Kraus operators of $\Phi$ . Note that the Kraus operators are not uniquely defined for a linear map $\Phi$ .

Definition 2.3 (Doubly Balanced Map).

A linear map $\Phi$ is called unital if $\Phi(I_{n})=I_{m}$ . A linear map $\Phi$ is called trace preserving if $\Phi^{*}(I_{m})=I_{n}$ (which implies that $\operatorname{tr}(\Phi(Y))=\operatorname{tr}(Y)$ for any $Y\in\mathbb{R}^{n\times n}$ ). A linear map $\Phi$ is called doubly balanced if there exists $c>0$ such that $c\sqrt{n}\Phi$ is unital and $c\sqrt{m}\Phi$ is trace preserving.

Using this terminology, the operator scaling problem can be rephrased as given the Kraus operators $(A_{1},\ldots,A_{k})$ of a completely positive map, find a left scaling matrix $L$ and a right scaling matrix $R$ so that the completely positive map defined by the Kraus operators $(LA_{1}R,\ldots,LA_{k}R)$ is non-zero doubly balanced.

For each completely positive linear map $\Phi$ , we can associate a matrix representation describing the same linear transformation.

Definition 2.4 (Natural Matrix Representation of Linear Map).

Given a linear map $\Phi:\mathbb{R}^{n\times n}\to\mathbb{R}^{m\times m}$ , we can interpret it as a matrix $M_{\Phi}:\mathbb{R}^{n^{2}}\to\mathbb{R}^{m^{2}}$ by vectorizing the input and output matrices such that

[TABLE]

where ${\rm vec}:\mathbb{R}^{n\times n}\to\mathbb{R}^{n^{2}}$ is the linear map satisfying ${\rm vec}(E_{i,j})=e_{i}\otimes e_{j}$ for all $1\leq i,j\leq n$ , where $E_{i,j}$ is the $n\times n$ matrix with one in the $(i,j)$ -th entry and zero otherwise and $e_{i}\in\mathbb{R}^{n}$ is the vector with one in the $i$ -th entry and zero otherwise.

There is a one-to-one correspondence between the matrix representations and the linear maps. Given a matrix $M:\mathbb{R}^{n^{2}}\to\mathbb{R}^{m^{2}}$ , we can also interpret it as a map $\Phi_{M}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m\times m}$ by matrixizing the input and output vectors such that

[TABLE]

where ${\rm mat}:\mathbb{R}^{n^{2}}\to\mathbb{R}^{n\times n}$ is the linear map satisfying ${\rm mat}(e_{i}\otimes e_{j})=E_{i,j}$ .

The matrix representation of a completely positive map has a nice form in terms of its Kraus operators.

Fact 2.5 (Proposition 2.20 in [61]).

Given a completely positive map $\Phi_{{\cal A}}$ with Kraus operators ${\cal A}$ , the matrix representation $M_{{\cal A}}$ can be written in the form described in Definition 1.4 such that

[TABLE]

2.1.2 Spectral Gap Condition and Quantum Expanders

Given the correspondence between the completely positive linear map $\Phi_{{\cal A}}$ and the natural matrix representation $M_{{\cal A}}$ , the spectral gap condition in Definition 1.4 can be presented as follows.

Definition 2.6 (Spectral Gap Condition of $\Phi$ ).

Given an operator ${\cal A}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ , let

[TABLE]

and $Y_{1},y_{1}$ as maximizers to the optimization problems with $y_{1}={\rm vec}(Y_{1})$ . Let

[TABLE]

The spectral gap condition in Definition 1.4 is equivalent to $\sigma_{2}(\Phi_{{\cal A}})\leq(1-\lambda)s({\cal A})/\sqrt{mn}$ .

The concept of quantum expander was studied by Hastings [35] and Ben-Aroya, Schwartz, and Ta-Shma [7], which was stated using the above language with $m=n$ .

Definition 2.7 (Quantum Expander [35, 7]).

An operator ${\cal A}=(A_{1},\ldots,A_{k})$ where each $A_{i}\in\mathbb{R}^{n\times n}$ is called a $(1-\lambda)$ -quantum expander if

The largest singular value is $s({\cal A})/n$ and the identity matrix $I_{n}$ is the largest left and right singular vector, i.e.

[TABLE] 2. 2.

For any $Y$ orthogonal to $I_{n}$ , it holds that

[TABLE]

In [7, 35], the map $\Phi$ is defined as $\frac{1}{k}\sum_{i=1}^{k}U_{i}YU_{i}^{*}$ , where $U_{i}\in\mathbb{R}^{n\times n}$ is a unitary matrix. Then, the size of this operator is equal to $n$ , and the largest singular value is $1$ achieved at the identity matrix.

When the operator ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced, we will show in Lemma 3.6 that $\sigma_{1}(\Phi_{{\cal A}})\leq(1+{\epsilon})s({\cal A})/\sqrt{mn}$ and $I_{n}$ is an approximate optimizer. Therefore, in the case $m=n$ , the spectral gap condition in Definition 1.4 is a more relaxed version of the quantum expander definition in [7], where we do not require $I_{n}$ to be the optimizer (but only an approximate optimizer).

From random matrix theory [58], almost all random non-negative matrices (from reasonable distributions) have a constant spectral gap, i.e. $\lambda$ is a constant. For random operators, Hastings [35] proved that the operator ${\cal A}$ has an almost Ramanujan spectral gap with $\lambda=1-2\sqrt{k-1}/k$ if each $A_{i}$ is a random unitary matrix. This result has been extended recently by Gonźalez-Guilén, Junge and Nechita to more general distributions [24]. It is reasonable to expect that most random operators have a constant spectral gap. There are also deterministic constructions of quantum expanders [7]. See [7, 35] for some applications of quantum expanders.

2.1.3 Choi Matrix and Useful Facts

There is another matrix representation that is useful in studying completely positive linear maps.

Definition 2.8 (Choi Matrix).

Given a completely positive linear map $\Phi_{{\cal A}}:\mathbb{R}^{n\times n}\to\mathbb{R}^{m\times m}$ , the Choi matrix $Q_{{\cal A}}\in\mathbb{R}^{mn\times mn}$ is defined as

[TABLE]

Using the Choi matrix, we can rephrase the operator scaling problem as finding left scaling matrix $L\in\mathbb{R}^{m\times m}$ and right scaling matrix $R\in\mathbb{R}^{n\times n}$ so that the scaled Choi matrix $P:=(L\otimes R)Q(L\otimes R)^{*}$ satisfies

[TABLE]

where the partial trace operations $\operatorname{tr}_{n}$ and $\operatorname{tr}_{m}$ are linear functions that satisfy $\operatorname{tr}_{n}(X\otimes Y):=\operatorname{tr}(Y)\cdot X$ and $\operatorname{tr}_{m}(X\otimes Y)=\operatorname{tr}(X)\cdot Y$ for $X\in\mathbb{R}^{m\times m}$ and $Y\in\mathbb{R}^{n\times n}$ . This phrasing of the operator scaling problem is in line with the more general quantum marginal problem [11].

The following facts will be useful in our proofs. All but (4) are relatively straightforward.

Fact 2.9.

In the following, $\Phi_{{\cal A}}$ is the completely positive map with Kraus operators ${\cal A}=(A_{1},\ldots,A_{k})$ where each $A_{i}\in\mathbb{R}^{m\times n}$ .

For any matrices $A,X\in\mathbb{R}^{m\times m}$ and $B,Y\in\mathbb{R}^{n\times n}$ ,

[TABLE] 2. 2.

$\Phi_{{\cal A}}(Y)\succeq 0$ * for any $Y\succeq 0$ .* 3. 3.

For any $X\in\mathbb{R}^{m\times m}$ and $Y\in\mathbb{R}^{n\times n}$ ,

[TABLE] 4. 4.

Let $L\in\mathbb{R}^{m\times m}$ and $R\in\mathbb{R}^{n\times n}$ and define the scaled operator $L{\cal A}R:=\{LA_{1}R,\ldots,LA_{k}R\}$ . Then,

[TABLE]

2.2 Continuous Operator Scaling

The continuous operator scaling algorithm was studied in [45]. We collect the definitions and the results that we will use in this subsection. We start with some definitions about operator scaling that we have already stated in the introduction.

2.2.1 Operator Scaling

Definition 2.10 (Operator).

An operator ${\cal A}$ is defined by a tuple of $m\times n$ matrices ${\cal A}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ .

Definition 2.11 (Size of an Operator).

The size of an operator ${\cal A}$ is defined as

[TABLE]

Definition 2.12 ( ${\epsilon}$ -nearly Doubly Balanced Operator).

An operator ${\cal A}$ is called ${\epsilon}$ -nearly doubly balanced if

[TABLE]

${\cal A}$ * is called doubly balanced when ${\epsilon}=0$ .*

Definition 2.13 ( $\ell_{2}$ -error).

Given an operator ${\cal A}$ , define

[TABLE]

Definition 2.14 (Error Matrices).

We define the error matrices as

[TABLE]

Note that $\operatorname{tr}(E)=\operatorname{tr}(F)=0$ , as

[TABLE]

where the last equality is by Definition 2.11. Also, we write

[TABLE]

so that $\Delta=\Delta_{E}+\Delta_{F}$ .

The $\ell_{2}$ -error is bounded for an ${\epsilon}$ -nearly doubly balanced operator.

Lemma 2.15 (Lemma 3.6.1 in [45]).

For an ${\epsilon}$ -nearly doubly balanced operator ${\cal A}$ ,

[TABLE]

2.2.2 Dynamical System

Definition 2.16 (Dynamical System).

The following dynamical system describes how ${\cal A}$ changes over time in the continuous operator scaling algorithm:

[TABLE]

We show in Lemma A.1 in Appendix A that the dynamical system is equivalent to the gradient flow with potential function $\Delta({\cal A})$ .

It is shown in [45] that the dynamical system will converge to a solution ${\cal A}^{(\infty)}$ with $\Delta({\cal A}^{(\infty}))=0$ . The following lemmas describe how the different quantities evolve in the dynamical system. We use the superscript (t) to represent the quantity of interest at time $t$ in the dynamical system, and omit it when the time $t$ is clear from context.

Lemma 2.17 (Lemma 3.4.2 in [45]).

The change of the size of the operator ${\cal A}^{(t)}$ at time $t$ is

[TABLE]

The following lemma was proved directly in [45]. It can also be seen as a consequence that the dynamical system is the gradient flow on $\Delta$ .

Lemma 2.18 (Lemma 3.4.3 in [45]).

The change of $\Delta^{(t)}$ at time $t$ is

[TABLE]

The following result was used in [45] for the smoothed analysis when the dynamical system has linear convergence.

Lemma 2.19 (Proposition 4.3.1 in [45]).

Suppose there exists $\mu>0$ such that for all $0\leq t\leq T$ ,

[TABLE]

Then

[TABLE]

2.2.3 Operator Capacity

Definition 2.20 (Capacity).

The capacity of an operator ${\cal A}$ is defined as

[TABLE]

It was shown in [45] that the convergence rate of $\Delta$ can be used to derive a lower bound on operator capacity.

Proposition 2.21 (Proposition 4.3.1 in [45]).

Suppose there exists $\mu>0$ such that for all $t\geq 0$ , it holds that

[TABLE]

Then, it follows that

[TABLE]

3 Spectral Analysis of Operator Scaling

We prove the main technical results in this section.

3.1 Overview

The main goal is to show that the dynamical system in Definition 2.16 has linear convergence. Let ${\cal A}$ be an ${\epsilon}$ -nearly doubly balanced operator with $\lambda$ -spectral gap. Assuming $\lambda^{2}\geq C{\epsilon}\ln m$ for a sufficiently large constant $C$ , we will prove that for all time $t\geq 0$ ,

[TABLE]

We start by looking more closely at the expression for the change of $\Delta$ .

Lemma 3.1.

The change of $\Delta$ is

[TABLE]

Proof.

By Lemma 2.18 and Definition 2.16,

[TABLE]

and the lemma follows from Fact 2.9(3) that $\langle E,\Phi(F)\rangle=\langle Q,E\otimes F\rangle$ . ∎

We call the terms $\langle E^{2},\Phi(I_{m})\rangle$ and $\langle F^{2},\Phi^{*}(I_{n})\rangle$ the quadratic terms as they are always non-negative, and we call the term $2\langle Q,E\otimes F\rangle$ the cross term. The proof outline is the following:

In Section 3.2, we prove a structural result that bounds the operator norms of $E^{(t)}$ and $F^{(t)}$ throughout the dynamical system using the envelope theorem. This implies a bound on the operator norm of $\Phi^{(t)}(I_{n})$ and ${\Phi^{(t)}}^{*}(I_{m})$ , which is used to show that the sum of the quadratic terms is at least $(1-{\epsilon})s\Delta$ . 2. 2.

In Section 3.3, we bound the largest singular value of the matrix $M_{{\cal A}}$ and show that $I$ is an approximate largest singular vector, and then we use a spectral argument to upper bound the absolute value of the cross term to be at most $(1+{\epsilon}-\lambda)s\Delta$ . 3. 3.

These two parts combine to show that $-\Delta^{\prime}\geq\lambda s\Delta$ when the spectral gap condition holds. To prove the linear convergence for all time $t\geq 0$ , we need to prove that the spectral gap condition is maintained throughout the dynamical system. To do this, we bound the condition number of the scaling solutions in Section 3.5, and use it to conclude that the spectral gap condition and the linear convergence hold throughout in Section 3.6.

In Section 3.7 and Section 3.8, we use the results to prove Theorem 1.7 and Theorem 1.8 about condition number and operator capacity respectively.

Finally, in Section 3.9, we explain how to discretize the gradient flow to obtain a discrete algorithm with linear convergence under the spectral assumption.

3.2 Lower Bounding the Quadratic Terms

First, we prove a structural result bounding the operator norm of the error matrices $E^{(t)}$ and $F^{(t)}$ for all $t\geq 0$ in Proposition 3.2, which will also be useful in bounding the condition number of the scaling solution in Section 3.5. Then we will use this proposition to lower bound the quadratic terms.

Proposition 3.2.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced, then for any $t\geq 0$ ,

[TABLE]

Proof.

The main idea is to show that the change of the quadratic form ${\frac{d}{dt}}u^{*}E^{(t)}u$ in the direction $u$ achieving $\left\lVert E^{(t)}\right\rVert_{\rm op}$ is at most $2\Delta^{(t)}$ , and then to use it to conclude that $\left\lVert E^{(t)}\right\rVert_{\rm op}\leq\left\lVert E^{(0)}\right\rVert_{\rm op}+\int_{0}^{t}2\Delta^{(\tau)}d\tau$ to complete the proof using Lemma 2.17. Note that the direction $u$ achieving $\left\lVert E^{(t)}\right\rVert_{\rm op}$ varies over time $t$ . To turn this idea into a formal proof, we use the generalized envelope theorem proven by Milgrom and Segal [49].

Theorem 3.3 (Corollary 4 in Milgrom and Segal [49]).

Suppose that $X$ is a nonempty compact space, $f(x,t)$ is continuous in $x$ and $f_{t}(x,t)={\frac{\partial}{\partial t}}f(x,t)$ is continuous in $(x,t)$ . Then the function $g(t)=\max_{x\in X}f(x,t)$ is differentiable almost everywhere and satisfies

[TABLE]

where $x^{*}(\tau)$ is any optimizer at time $\tau$ satisfying $g(\tau)=f(x^{*}(\tau),\tau)$ .

To apply the theorem, we define the space $X$ to be $\{0,1\}\times\{0,1\}\times\mathbb{S}^{m-1}\times\mathbb{S}^{n-1}$ , which is clearly nonempty and compact. The first coordinate indicates whether we are considering the error matrix $E$ or $F$ . The second coordinate indicates whether we are considering the largest or smallest eigenvalue of the error matrix. The third and fourth coordinates indicate the unit test vectors we are applying to $E$ and $F$ . The function $f$ is defined as follows:

[TABLE]

It is clear that $f(x,t)$ is continuous in $x\in X$ and ${\frac{\partial}{\partial t}}f(x,t)$ is continuous in $(x,t)$ . Hence, by Theorem 3.3, the function $g(t)=\max_{x\in X}f(x,t)$ satisfies

[TABLE]

Since $E^{(t)}$ and $F^{(t)}$ are Hermitian matrices,

[TABLE]

and so $g(0)\leq\epsilon s^{(0)}$ by the assumption that ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced. To compute the partial derivative, we consider the four cases of the optimizer $x^{*}(t)$ at time $t$ one by one.

$x^{*}(t)=(0,0,u,v)$ . As $E^{(t)}$ and $F^{(t)}$ are Hermitian matrices, the optimizer $u$ of $\left\lVert E^{(t)}\right\rVert_{\rm op}$ is a maximum eigenvector of $E^{(t)}$ satisfying $E^{(t)}u=g(t)\cdot u$ , and $F^{(t)}\succeq-g(t)\cdot I_{n}$ as $\left\lVert E^{(t)}\right\rVert_{\rm op}\geq\left\lVert F^{(t)}\right\rVert_{\rm op}$ in this case. Then, by the definition of ${\frac{d}{dt}}A_{i}^{(t)}$ in Definition 2.16 and ${\frac{d}{dt}}s^{(t)}=-2\Delta^{(t)}$ from Lemma 2.17, it follows that

[TABLE]

where the inequality follows from $\sum_{i=1}^{k}A_{i}FA_{i}^{*}=\Phi_{{\cal A}}(F)\succeq\Phi_{{\cal A}}(-g(t)\cdot I_{n})=-g(t)\sum_{i=1}^{k}A_{i}A_{i}^{*}$ by Fact 2.9(2). 2. 2.

$x^{*}(t)=(0,1,u,v)$ . In this case, $E^{(t)}u=-g(t)\cdot u$ , $F^{(t)}\preceq g(t)\cdot I_{n}$ and by similar calculations of the first case, we have

[TABLE] 3. 3.

$x^{*}(t)=(1,0,u,v)$ . By symmetry of $E^{(t)}$ and $F^{(t)}$ , we get the same bound as the first case:

[TABLE] 4. 4.

$x^{*}(t)=(1,1,u,v)$ . By symmetry of $E^{(t)}$ and $F^{(t)}$ , we get the same bound as the second case:

[TABLE]

Therefore, in any case we have $f_{t}(x^{*}(t),t)\leq 2\Delta(t)$ , and we conclude that

[TABLE]

where the first equality is by Lemma 2.17 that ${\frac{d}{dt}}s^{(t)}=-2\Delta^{(t)}$ . ∎

We have the following corollary by rewriting the conclusions of Proposition 3.2 using the definitions that $E^{(t)}=sI_{m}-m\Phi^{(t)}(I_{n})$ and $F^{(t)}=sI_{n}-n{\Phi^{(t)}}^{*}(I_{m})$ .

Proposition 3.4.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced, then for any $t\geq 0$ ,

[TABLE]

and

[TABLE]

We can use Proposition 3.4 to lower bound the quadratic terms in Lemma 3.1.

Lemma 3.5.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced, then for any $t\geq 0$ ,

[TABLE]

Proof.

By Proposition 3.4 and the fact that $\langle X,Y\rangle\geq 0$ for positive semidefinite matrices $X,Y$ ,

[TABLE]

∎

3.3 Upper Bounding the Cross Term

We will first bound the largest singular value of the matrix $M_{{\cal A}}$ for any ${\epsilon}$ -nearly doubly balanced operator ${\cal A}$ . Then, we will use a spectral argument to upper bound the absolute value of the cross term in Lemma 3.1.

Given a non-negative matrix, it is known that the square of the largest singular value is bounded by the product of the maximum row sum and the maximum column sum (see [38]). The proof of this bound is generalized to prove the following lemma.

John Watrous provided a different proof of Lemma 3.6 by generalizing the proof of Theorem 4.27 in his book [61]. We include his proof in Lemma A.2 in Appendix A.

Lemma 3.6.

If ${\cal A}$ is an ${\epsilon}$ -nearly doubly balanced operator, then the largest singular value of its matrix representation $M_{{\cal A}}$ in Definition 1.4 is

[TABLE]

Proof.

Given a vector norm $\left\lVert\cdot\right\rVert$ , we can define an induced matrix norm ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}:=\sup_{x}\left\lVert Mx\right\rVert/\left\lVert x\right\rVert$ . To prove the lemma, we define the vector norm for vectors in $\mathbb{R}^{n^{2}}$ for any $n$ and its induced matrix norm for matrices in $\mathbb{R}^{m^{2}\times n^{2}}$ for any $m$ as

[TABLE]

where ${\rm mat}(\cdot)$ is the matrixizing operation in Definition 2.4 and $\left\lVert x\right\rVert_{\rm op}$ is the standard operator norm of a matrix.

For a positive semidefinite matrix $H\in\mathbb{R}^{m^{2}\times m^{2}}$ for some $m$ , we can bound its largest eigenvalue by this matrix norm, i.e. $\lambda_{1}(H)\leq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|H\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\rm op}$ . To see this, let $v\in\mathbb{R}^{m^{2}}$ be an eigenvector with $Hv=\lambda_{1}v$ , then

[TABLE]

We apply this inequality to bound the largest singular value of $M_{{\cal A}}$ , by considering the square matrix $M_{{\cal A}}M_{{\cal A}}^{*}$ and its largest eigenvalue:

[TABLE]

As $M_{{\cal A}}\in\mathbb{R}^{m^{2}\times n^{2}}$ is the natural matrix representation of the completely positive map $\Phi_{{\cal A}}$ defined by the operator ${\cal A}$ ,

[TABLE]

where the second equality is from Definition 2.4 and the last equality is by the theorem [9] that

[TABLE]

By a similar argument, ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|M_{{\cal A}}^{*}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}_{\rm op}=\left\lVert\Phi_{{\cal A}}^{*}(I_{m})\right\rVert_{\rm op}$ . Therefore,

[TABLE]

where the last inequality follows from the assumption that ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced in Definition 2.12. Taking the square root on both sides gives the lemma. ∎

Lemma 3.6 implies that ${\rm vec}(I_{n})$ is an “approximate” first singular vector of $M_{{\cal A}}$ . By the spectral gap condition in Definition 1.4, it will follow that any vector perpendicular to ${\rm vec}(I_{n})$ has a “small” quadratic form of $M_{{\cal A}}$ , and this can be used to bound the cross term in Lemma 3.1. The following lemma summarizes the spectral argument, which will be used to bound the cross term in the next lemma.

Lemma 3.7.

Let $A\in\mathbb{R}^{m\times n}$ . Let $p\in\mathbb{R}^{m}$ and $q\in\mathbb{R}^{n}$ be unit vectors. Suppose the following assumptions hold:

[TABLE]

Then, for any unit vectors $x\perp p$ and $y\perp q$ , it holds that $|x^{*}Ay|\leq 1+\delta_{1}-\delta_{2}.$

Proof.

First, we show that $p$ and $q$ are highly correlated with the first singular vectors of $A$ . Let $A=\sum_{i}\sigma_{i}u_{i}v_{i}^{*}$ be its singular value decomposition with $\sigma_{1}\geq\sigma_{2}\geq\dots\geq 0$ and $\{u_{i}\}$ and $\{v_{i}\}$ are orthonormal bases. Write $p$ and $q$ as linear combinations of singular vectors as $p=\sum_{i}c_{i}u_{i}$ and $q=\sum_{i}d_{i}v_{i}$ . We will show that $c_{1}$ and $d_{1}$ are large. Observe that, since $I_{m}\succeq pp^{*}$ ,

[TABLE]

and similarly $\|A^{*}p\|_{2}^{2}\geq 1$ . So we have

[TABLE]

where the last inequality is because $\sum_{i}c_{i}^{2}=\left\lVert p\right\rVert^{2}_{2}=1$ and $\sigma_{2}^{2}\geq\sigma_{j}^{2}$ for $j\geq 2$ . Using our assumptions about $\sigma_{1}$ and $\sigma_{2}$ , it follows that

[TABLE]

which implies that

[TABLE]

By the same calculation, we have $d_{1}^{2}\geq\delta_{2}/(\delta_{1}+\delta_{2})$ .

Next, we show that $x\perp p$ and $y\perp q$ are not highly correlated with the first singular vectors. Write $x=\sum_{i}\alpha_{i}u_{i}$ and $y=\sum_{i}\beta_{i}v_{i}$ with $\sum_{i}\alpha_{i}^{2}=\left\lVert x\right\rVert^{2}_{2}=1$ and $\sum_{i}\beta_{i}^{2}=\left\lVert y\right\rVert^{2}_{2}=1$ . We will show that $\alpha_{1}$ and $\beta_{1}$ are small. Since $\langle x,p\rangle=0$ by our assumption,

[TABLE]

By the same calculation, we have $\beta_{1}^{2}\leq\delta_{1}/(\delta_{1}+\delta_{2}).$

Finally, we bound the absoluate value of the quadratic form

[TABLE]

Using our assumptions on $\sigma_{1}$ and $\sigma_{2}$ and Cauchy-Schwarz inequality,

[TABLE]

Putting in the upper bounds on $\alpha_{1}^{2}$ and $\beta_{1}^{2}$ derived above, we conclude

[TABLE]

∎

We use Lemma 3.7 to bound the cross term in Lemma 3.1.

Lemma 3.8.

If ${\cal A}$ satisfies the spectral gap condition in Definition 1.4 with the additional assumption that $\sigma_{1}(M_{{\cal A}})\leq(1+\delta)s({\cal A})/\sqrt{mn}$ for $\delta\leq 1$ , then

[TABLE]

Proof.

Note that the cross term

[TABLE]

where the first equality is by Fact 2.9(3) and the second equality is by the definition of matrix representation in Definition 2.4.

To prove the lemma, we apply Lemma 3.7 with

[TABLE]

and

[TABLE]

Clearly, $p,q$ are unit vectors, and $x,y$ are also unit vectors as $\left\lVert x\right\rVert_{2}=\left\lVert E\right\rVert_{F}/\sqrt{m\Delta_{E}}=1$ by Definition 2.14 and similarly $\left\lVert y\right\rVert_{2}=1$ . Note that $x\perp p$ as $\langle x,p\rangle=\langle E,I_{m}\rangle/(m\sqrt{\Delta_{E}})$ and $\langle E,I_{m}\rangle=\operatorname{tr}(E)=0$ from Definition 2.14, and similarly $y\perp q$ .

We check the assumptions of Lemma 3.7. By the additional assumption,

[TABLE]

and so we can set $\delta_{1}:=2\delta+\delta^{2}$ . By the spectral gap condition in Definition 1.4,

[TABLE]

and so we can set $\delta_{2}:=2\lambda-\lambda^{2}$ . Also, we check that

[TABLE]

where the second equality is from Definition 2.4 and the last equality is from Definition 2.11.

Therefore, we can conclude from Lemma 3.7 that

[TABLE]

Finally, we complete the proof using the inequality $\sqrt{\Delta_{E}\Delta_{F}}\leq(\Delta_{E}+\Delta_{F})/2=\Delta/2$ , and $\delta\leq 1$ by our assumption, and $\lambda\leq 1$ by definition. ∎

3.4 Lower Bounding the Convergence Rate

Putting the bounds in Lemma 3.5 and Lemma 3.8 into Lemma 3.1, we obtain the following lower bound on the convergence rate of $\Delta$ at any time $t$ .

Proposition 3.9.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced and the matrix representation $M_{{\cal A}^{(t)}}$ of ${\cal A}^{(t)}$ satisfies the spectral conditions that

[TABLE]

then

[TABLE]

Note that Proposition 3.9 implies that the dynamical system has linear convergence at time $t=0$ . To see this, note that $\delta^{(0)}\leq{\epsilon}$ by Lemma 3.6, and $\lambda^{(0)}=\lambda$ from Definition 1.4, and therefore

[TABLE]

Under our assumption that $\lambda\gg{\epsilon}$ , the dynamical system has linear convergence at time $t=0$ with rate $\lambda s^{(0)}$ .

To prove that the dynamical system has linear convergence with rate $\lambda s^{(0)}$ for all time $t\geq 0$ , we will prove that the quantities in Proposition 3.9 do not change much when we move from ${\cal A}^{(0)}$ to ${\cal A}^{(t)}$ , i.e. $s^{(t)}\approx s^{(0)}$ , $\delta^{(t)}\approx\delta^{(0)}$ , and $\lambda^{(t)}\approx\lambda$ .

To bound the change of the singular values of $M_{{\cal A}^{(t)}}$ , we will bound the condition number of the scaling solutions in the dynamical system in Section 3.5, and then use these bounds to argue about the change of the singular values and establish Theorem 1.5 in Section 3.6.

3.5 Scaling Solutions and Condition Numbers

We first present the results in product integration in Slavik’s book [55] in Section 3.5.1, and then use these results to bound the condition number of the scaling solutions in Section 3.5.2.

3.5.1 Scaling Solutions

The dynamical system in Definition 2.16 describes the change of ${\cal A}$ by a differential equation. The solution to the differential equation can be analyzed using the theory of product integration in [55].

Definition 3.10.

Let $A:[a,b]\to\mathbb{R}^{n\times n}$ be a matrix valued function. A partition $t$ of the interval $[a,b]$ is a sequence of numbers $a=t_{0}<t_{1}<t_{2}<\cdots<t_{m}=b$ . Let $\Delta t_{i}=t_{i}-t_{i-1}$ for $i=1,\cdots,n$ and $\Delta t=\max_{i=1,\cdots n}\Delta t_{i}$ . When the limits over all partitions with $\Delta t\to 0$ exist, the left product integral is defined as

[TABLE]

and the right product integral is defined as

[TABLE]

Theorem 3.11 (Theorem 2.5.1 in [55]).

If $P,Q:[a,b]\to\mathbb{R}^{n\times n}$ are continuous matrix functions, then the product integrals

[TABLE]

exist and satisfy the equations

[TABLE]

*for every $x\in[a,b]$ . *

Applying Theorem 3.11 with $P(t)=E^{(t)}$ , $Q(t)=F^{(t)}$ , $Y(x)=L^{(T)}$ and $Z(x)=R^{(T)}$ , we can explicitly describe the scaling matrices of the dynamical system.

Corollary 3.12.

The solution to the dynamical system in Definition 2.16 is $A_{i}^{(T)}=L^{(T)}A_{i}^{(0)}R^{(T)}$ where

[TABLE]

We are interested in bounding the condition number of $L^{(T)}$ and $R^{(T)}$ .

Definition 3.13 (Condition Number).

The condition number of a matrix $A$ is defined as

[TABLE]

*where $\sigma_{\max}(A)$ and $\sigma_{\min}(A)$ are the maximum singular value and the minimum singular value of $A$ respectively. *

The following theorem in Slavik [55] will be used to bound $\kappa(L^{(T)})$ and $\kappa(R^{(T)})$ .

Theorem 3.14 (Corollary 3.4.3 in [55]).

If $P,Q:[a,b]\to\mathbb{R}^{n\times n}$ are Riemann integrable functions, then

[TABLE]

Applying Theorem 3.14 with $Q(x)=E^{(t)}$ and $P(x)=0$ , we have the following bound of the maximum and minimum eigenvalues of $L^{(T)}$ .

Corollary 3.15.

For any $T\geq 0$ ,

[TABLE]

This corollary will be used to bound the condition number of $L^{(T)}$ in Lemma 3.16, which will then be used to bound the condition number of $R^{(T)}$ in Lemma 3.18.

3.5.2 Bounding the Condition Number

To bound the condition number, we use Corollary 3.15 and bound the integral in the exponent. To bound the integral, we divide the time into two phases. In the first phase, we use Proposition 3.2 to argue that $\left\lVert E^{(t)}\right\rVert_{\rm op}\approx\left\lVert E^{(0)}\right\rVert_{\rm op}$ . In the second phase, we use that $\Delta^{(t)}$ is converging linearly to argue that $\left\lVert E^{(t)}\right\rVert_{\rm op}\leq\left\lVert E^{(t)}\right\rVert_{F}\leq\sqrt{m\Delta^{(t)}}$ is converging linearly. In the following lemma, we should think of $g$ as the spectral gap parameter in Definition 1.4.

Lemma 3.16.

Suppose there exists $g>0$ such that for all $0\leq t\leq T$ , it holds that

[TABLE]

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq g$ , then

[TABLE]

Proof.

To bound the condition number, we use Corollary 3.15 and bound the integral

[TABLE]

We split the integral into two terms. For the first term, we use Proposition 3.2 to bound

[TABLE]

where the second inequality is by the fact that $s^{(t)}$ is non-increasing from Lemma 2.17. Applying Lemma 2.19 with our assumption that $\mu=gs^{(0)}$ , it follows that

[TABLE]

where the second inequality is by Lemma 2.15, and the last inequality is by our assumption that $g\geq{\epsilon}$ .

For the second term,

[TABLE]

where the second inequality is from the inequality that $\|E^{(t)}\|_{F}^{2}\leq m\Delta^{(t)}$ from Definition 2.14, and the third inequality follows from Lemma 2.19 using the assumption that $\Delta$ is converging linearly with $\mu=gs^{(0)}$ .

We choose

[TABLE]

This implies that

[TABLE]

and so the second term is at most $3{\epsilon}/g$ . The first term is at most $5\tau{\epsilon}s^{(0)}\leq 5{\epsilon}\ln m/g$ , and so Corollary 3.15 implies that

[TABLE]

∎

Remark 3.17.

We have some examples indicating that the $\log m$ term in the condition number is necessary, but we do not have a formal proof for this lower bound at the time of writing.

We cannot use the same argument to bound $\left\lVert R^{(T)}-I\right\rVert_{\rm op}$ , as it will only give us a bound with dependency on $n$ (where we assumed $m\leq n$ ). Instead, we use the bound on $\left\lVert L^{(T)}-I\right\rVert_{\rm op}$ to derive a similar bound on $\left\lVert R^{(T)}-I\right\rVert_{\rm op}$ .

Lemma 3.18.

Suppose there exists $g>0$ such that for all $0\leq t\leq T$ , it holds that

[TABLE]

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq g$ , and also ${\epsilon},\ell\leq 1/2$ , then

[TABLE]

Proof.

We would like to bound

[TABLE]

First, we bound $r_{\max}$ . Let $u\in\mathbb{R}^{n}$ be a maximizer such that $\left\lVert R^{(T)}u\right\rVert_{2}=r_{\max}$ and $\left\lVert u\right\rVert_{2}=1$ .

Consider $\langle{\Phi^{(T)}}^{*}(I_{m}),uu^{*}\rangle$ . On one hand, we use Proposition 3.4 to upper bound

[TABLE]

On the other hand, by Fact 2.9(4),

[TABLE]

Since $\left\lVert L^{(T)}-I_{m}\right\rVert_{\rm op}\leq\ell$ , all singular values of $L^{(T)}$ are at least $1-\ell$ , and thus all eigenvalues of $L^{(T)}{L^{(T)}}^{*}$ are at least $(1-\ell)^{2}$ , i.e. $L^{(T)}{L^{(T)}}^{*}\succeq(1-\ell)^{2}I_{m}$ . It follows from Fact 2.9(2) that ${\Phi^{(0)}}^{*}\left({L^{(T)}}^{*}L^{(T)}\right)\succeq{\Phi^{(0)}}^{*}\Big{(}(1-\ell)^{2}I_{m}\Big{)}$ , and using it in the above equation gives

[TABLE]

where the second inequality uses that ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced. Combining the upper bound and lower bound gives

[TABLE]

where we use the assumptions that ${\epsilon},\ell\leq 1/2$ .

Next, we bound $r_{\min}$ using a similar argument. Let $v\in\mathbb{R}^{n}$ be a minimizer such that $\left\lVert R^{(T)}v\right\rVert_{2}=r_{\min}$ and $\left\lVert v\right\rVert_{2}=1$ . Consider $\langle{\Phi^{(T)}}^{*}(I_{m}),vv^{*}\rangle$ . On one hand, we use Proposition 3.4 to lower bound

[TABLE]

where the second inequality uses the assumption that $\Delta^{(t)}$ is converging linearly for $0\leq t\leq T$ to apply Lemma 2.19 with $\mu=gs^{(0)}$ to obtain

[TABLE]

where the second inequality is by Lemma 2.15 and the last inequality is from the assumption that ${\epsilon}\leq g$ .

On the other hand, by a similar calculation as above with $L^{(T)}{L^{(T)}}^{*}\leq(1+\ell)^{2}I_{m}$ , we obtain

[TABLE]

Combining the upper bound and lower bound gives

[TABLE]

where we used the assumptions that ${\epsilon}$ and $\ell$ are sufficiently small. Therefore, we conclude that

[TABLE]

∎

3.6 Invariance of Linear Convergence

We will first use Lemma 3.16 and Lemma 3.18 to bound the change of the singular values of $M_{{\cal A}^{(t)}}$ . Then, we will combine the previous results to prove Theorem 1.5 that $\Delta^{(t)}$ is converging linearly for all $t\geq 0$ .

To bound the change of the singular values, we use the following inequality.

Lemma 3.19 (Theorem 3.3.16 in [37]).

Let $A$ and $B$ be two $m\times n$ matrices. For any $1\leq k\leq m$ ,

[TABLE]

The following lemma bounds the change of the singular values after scaling the operator.

Lemma 3.20.

For any $t\geq 0$ , suppose $\left\lVert L^{(t)}-I_{m}\right\rVert_{\rm op}\leq\zeta$ and $\left\lVert R^{(t)}-I_{n}\right\rVert_{\rm op}\leq\zeta$ for some $\zeta\leq 1$ , then

[TABLE]

Proof.

The operator at time $t$ is ${\cal A}^{(t)}=\left(L^{(t)}A_{1}^{(0)}R^{(t)},\ldots L^{(t)}A_{k}^{(0)}R^{(t)}\right)$ . By Fact 2.5, the matrix representation of the operator at time $t$ is

[TABLE]

where the second equality is by Fact 2.9(1). By Lemma 3.19,

[TABLE]

To bound the right hand side, we expand $L\otimes L$ as $(L-I)\otimes(L-I)+(L-I)\otimes I+I\otimes(L-I)+I\otimes I$ and expand $R\otimes R$ similarly. Then $(L^{(t)}\otimes L^{(t)})\cdot M_{{\cal A}^{(0)}}\cdot(R^{(t)}\otimes R^{(t)})-M_{{\cal A}^{(0)}}$ can be written as the sum of fifteen terms, with $M_{{\cal A}^{(0)}}$ cancelled with $(I\otimes I)M_{{\cal A}^{(0)}}(I\otimes I)$ . To bound the operator norm, we use the triangle inequality and bound the sum of the fifteen operator norms. For each term, we use the facts that $\left\lVert A\otimes B\right\rVert_{\rm op}\leq\left\lVert A\right\rVert_{\rm op}\left\lVert B\right\rVert_{\rm op}$ and $\left\lVert ABC\right\rVert_{\rm op}\leq\left\lVert A\right\rVert_{\rm op}\left\lVert B\right\rVert_{\rm op}\left\lVert C\right\rVert_{\rm op}$ to bound its norm. For example,

[TABLE]

Since we assumed that $\left\lVert L^{(t)}-I_{m}\right\rVert_{\rm op}\leq\zeta$ and $\left\lVert R^{(t)}-I_{n}\right\rVert_{\rm op}\leq\zeta$ for some $\zeta\leq 1$ , each of these term is at most $\zeta\left\lVert M_{{\cal A}^{(0)}}\right\rVert_{\rm op}$ and thus we conclude that $\left\lVert M_{{\cal A}^{(t)}}-M_{{\cal A}^{(0)}}\right\rVert_{\rm op}\leq 15\zeta\cdot\left\lVert M_{{\cal A}^{(0)}}\right\rVert_{\rm op}$ . ∎

We are ready to put together the results to prove the following theorem which implies Theorem 1.5.

Theorem 3.21.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced and ${\cal A}^{(0)}$ satisfies the $\lambda$ -spectral gap condition in Definition 1.4 with $\lambda^{2}\geq C{\epsilon}\ln m$ for a sufficiently large constant $C$ , then for all $t\geq 0$ it holds that

[TABLE]

Proof.

Recall from Proposition 3.9 the definitions of $\delta^{(t)}$ and $\lambda^{(t)}$ , and $\delta^{(0)}\leq{\epsilon}$ by Lemma 3.6 and $\lambda^{(0)}=\lambda$ from Definition 1.4. Let $T$ be the supremum such that $s^{(t)}\geq(1-{\epsilon})s^{(0)}$ and $\lambda^{(t)}-3\delta^{(t)}\geq\frac{1}{2}(\lambda^{(0)}-3\delta^{(0)})$ . Our goal is to prove that $\Delta^{(t)}$ is converging linearly for $0\leq t\leq T$ and $T$ is unbounded.

First, we show that $\Delta^{(t)}$ is converging linearly for $0\leq t\leq T$ . By Proposition 3.9,

[TABLE]

where in the second inequality we used that $s^{(t)}\geq(1-{\epsilon})s^{(0)}$ and $\lambda^{(t)}-3\delta^{(t)}\geq\frac{1}{2}(\lambda^{(0)}-3\delta^{(0)})$ for $0\leq t\leq T$ . Note that our assumption implies that $\lambda^{(0)}=\lambda\geq C{\epsilon}$ for a sufficiently large constant $C$ as $\lambda\leq 1$ . Since $\delta^{(0)}\leq{\epsilon}$ from Lemma 3.6, it follows that for any $0\leq t\leq T$ ,

[TABLE]

Next, we argue that the size condition and the spectral gap condition will still be maintained beyond time $T$ . For the size change, by Lemma 2.19 with $\mu=\lambda s^{(0)}$ ,

[TABLE]

where the second inequality is by Lemma 2.15 and the last inequality is by $\lambda\geq C{\epsilon}$ for a sufficiently large constant $C$ .

For the change of the second largest singular value, by definition,

[TABLE]

On the other hand, we can upper bound $\sigma_{2}(M_{{\cal A}^{(T)}})-\sigma_{2}(M_{{\cal A}^{(0)}})$ using condition numbers. Using Lemma 3.16 with $g=\lambda$ , $\left\lVert L^{(T)}-I\right\rVert_{\rm op}\leq\exp\left(O({\epsilon}\ln m/\lambda)\right)-1$ . Note that our assumption implies that

[TABLE]

where the implication is by the inequality $e^{x}-1\leq O(x)$ for $x$ close to zero. Then, by Lemma 3.18, we also have $\left\lVert R^{(T)}-I\right\rVert_{\rm op}\leq O\left(\lambda/C\right)$ . Putting these bounds into $\zeta$ of Lemma 3.20, we obtain

[TABLE]

Combining the upper bound and lower bound and using $\delta_{1}^{(0)}\leq{\epsilon}$ from Lemma 3.6, it follows that

[TABLE]

where the last inequality is by the assumption that $\lambda\geq C{\epsilon}$ .

For the change of the largest singular value, by Proposition 3.4,

[TABLE]

where the first and last inequalities use that $s^{(T)}\geq(1-{\epsilon})s^{(0)}$ . The same holds for ${\Phi^{(T)}}^{*}$ and these imply that ${\cal A}^{(T)}$ is $3{\epsilon}$ -nearly doubly balanced. By Lemma 3.6, this implies that $\delta^{(T)}\leq 3{\epsilon}$ . Therefore,

[TABLE]

where the second last inequality uses that $C$ is a sufficiently large constant.

Since our dynamical system is continuous, we still have both conditions satisfied at time $T+\eta$ for some $\eta>0$ , which contradicts that $T$ is the supremum that both conditions are satisifed. Therefore, $T$ is unbounded and the linear convergence of $\Delta$ is maintained throughout the execution of the dynamical system. ∎

3.7 Condition Number

With the invariance of the linear convergence, we can apply Lemma 3.16 and Lemma 3.18 to bound the condition number of the scaling solutions and prove Theorem 1.7

Theorem 3.22.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced and ${\cal A}^{(0)}$ satisfies the $\lambda$ -spectral gap condition in Definition 1.4 with $\lambda^{2}\geq C{\epsilon}\log m$ for a sufficiently large constant $C$ , then for any $t\geq 0$ ,

[TABLE]

In particular, these bounds hold for the final scaling solutions $L^{(\infty)}$ and $R^{(\infty)}$ .

Proof.

By Theorem 3.21, $\Delta^{(t)}$ is linearly converging for all time $t$ with rate at least $\lambda s^{(0)}$ . By Lemma 3.16, this implies that

[TABLE]

where we used the assumption that $\lambda^{2}\geq C{\epsilon}\ln m$ and $e^{x}-1\leq O(x)$ for $x$ close to zero. By Lemma 3.18, this implies the same bound on

[TABLE]

Therefore, $\lambda_{\min}(L^{(t)})\geq 1-O({\epsilon}\log m/\lambda)$ and $\lambda_{\max}(L^{(t)})\leq 1+O({\epsilon}\log m/\lambda)$ , and hence

[TABLE]

where we used that ${\epsilon}\log m/\lambda\ll 1$ . The same argument applies to give the same bound for $\kappa(R^{(t)})$ . ∎

3.8 Operator Capacity

Theorem 1.8 follows easily from Theorem 3.21.

Theorem 3.23.

If ${\cal A}^{(0)}$ is ${\epsilon}$ -nearly doubly balanced and ${\cal A}^{(0)}$ satisfies the $\lambda$ -spectral gap condition in Definition 1.4 with $\lambda^{2}\geq C{\epsilon}\ln m$ for a sufficiently large constant $C$ , then

[TABLE]

Proof.

By Theorem 3.21, $\Delta^{(t)}$ is linearly converging for all time $t$ with rate $\lambda s^{(0)}$ . Apply Proposition 2.21 with $\mu=\lambda s^{(0)}$ ,

[TABLE]

where the second inequality is by Lemma 2.15. ∎

3.9 Discrete Gradient Flow

The gradient flow can be discretized to give a polynomial time algorithm with linear convergence when the input has a spectral gap. The analysis follows closely the continuous case, so we will just provide a sketch.

Recall that the gradient flow is defined as

[TABLE]

where $E$ and $F$ are the error matrices (Definition 2.14) of the current operator ${\cal A}$ .

In the discrete case, a natural update step is

[TABLE]

for some small step size $\alpha$ , but the problem of this update step is that $\widetilde{{\cal A}}$ may not be a scaling of ${\cal A}$ . So we modified the discrete algorithm slightly as follows. In each step, we update

[TABLE]

where $\alpha$ is the step size. This update is to maintain that the current operator is a scaling of the original operator.

We assume that $s=1$ and $\Delta\leq 1$ initially. We will set the step size to be $\alpha=O((m+n)^{-2})$ for the same analysis in the continuous case to go through. With this choice of the step size, we can show that

[TABLE]

by expanding the change of the size $s$ and use the small step size $\alpha$ to argue that the higher order terms are negligible. By a similar but more tedious calculation (since the degree is higher), we can also show that

[TABLE]

where ${\frac{d}{dt}}\Delta$ is the change of $\Delta$ in the continuous case. This is also the step that we need $\alpha=O((m+n)^{-2})$ to hold. Since we know $-{\frac{d}{dt}}\Delta\geq\lambda s\Delta$ , this implies that

[TABLE]

that $\Delta$ is decreasing geometrically with rate $\lambda s$ , when the current operator ${\cal A}$ satisfies the spectral condition.

As in the continuous case, we use an inductive argument to prove that the spectral gap condition is maintained to establish that the convergence rate $\lambda s$ is maintained throughout the algorithm. Again, we go through the condition number of the error matrices, and use the arguments in Lemma 3.20 to show that the change of the singular value is

[TABLE]

and it follows that the $\lambda$ -spectral gap condition holds throughout as

[TABLE]

which is negligible when the spectral assumption $(\lambda^{(0)})^{2}\gg{\epsilon}\log(m+n)$ holds initially.

In the discrete algorithm, we will set the step size to be $\alpha=\Theta((m+n)^{-2})$ . If the continuous algorithm converges to an $\eta$ -approximate solution in time $T$ , the discrete algorithm will converge to an $\eta$ -approximate solution in $T\cdot\Theta((m+n)^{2})$ number of iterations, and the dependency on $\eta$ is $\log(1/\eta)$ by Theorem 1.5.

Remark 3.24.

The step size $\alpha=O((m+n)^{-2})$ is chosen for the same analysis as in the continuous to hold. It is an interesting open question whether the analysis can be extended to constant step size, in particular whether Sinkhorn’s alternating algorithm has the same convergence rate as in the gradient flow.

4 Applications of Matrix Scaling and Operator Scaling

In this section, we show some implications of our results in various applications of the operator scaling problem.

4.1 Matrix Scaling

Given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , let $s(B):=\sum_{i=1}^{m}\sum_{j=1}^{n}B_{i,j}$ be the size of the matrix, $r_{i}(B):=\sum_{j=1}^{n}B_{i,j}$ be the $i$ -th row sum of $B$ , and $c_{j}(B):=\sum_{i=1}^{m}B_{i,j}$ be the $j$ -th column sum of $B$ . A non-negative matrix is called ${\epsilon}$ -nearly doubly balanced if for every $1\leq i\leq m$ and for every $1\leq j\leq n$ ,

[TABLE]

and is called doubly balanced when ${\epsilon}=0$ . A common setting is when $B$ is an $n\times n$ matrix when the average row sum is equal to one, in which case $s(B)=n$ and the matrix is called “doubly stochastic” when every row sum and every column sum are equal to one.

Definition 4.1 (Matrix Scaling Problem).

We are given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , and the goal is to find a left diagonal scaling matrix $L\in\mathbb{R}^{m\times m}$ and a right diagonal scaling matrix $R\in\mathbb{R}^{n\times n}$ such that $LBR$ is doubly balanced, or report that such scaling matrices do not exist.

Outline: In the following, we will show that the matrix scaling problem can be reduced to the operator scaling problem in Section 4.1.1. Then, we will see that the spectral condition has a simple form in Section 4.1.2, and there is a natural combinatorial condition that implies the spectral condition in Section 4.1.3. We then argue that many random matrices will satisfy our condition in Section 4.1.4. Finally, we see the implications of our results in several applications of matrix scaling, including bipartite matching in Section 4.1.5, permanent lower bound in Section 4.1.6, and optimal transportation in Section 4.1.7.

4.1.1 Reduction to Operator Scaling

The matrix scaling problem is a special case of the operator scaling problem.

Lemma 4.2.

Given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , let ${\cal A}=(A_{11},\ldots,A_{mn})$ be the operator where each $A_{ij}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq m$ and $1\leq j\leq n$ is the matrix with the $(i,j)$ -th entry equal to $\sqrt{B_{i,j}}$ and all other entries equal to zero. Then, $B$ is ${\epsilon}$ -nearly doubly balanced if and only if ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced. Furthermore, there is a solution to the matrix scaling problem for $B$ if and only if there is a solution to the operator scaling problem for ${\cal A}$ .

Proof.

By construction, $A_{ij}A_{ij}^{*}$ is the $m\times m$ matrix with $B_{ij}$ in the $(i,i)$ -th entry and zero otherwise, and $A_{ij}^{*}A_{ij}$ is the $n\times n$ matrix with $B_{ij}$ in the $(j,j)$ -th entry and zero otherwise. So, $\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}A_{ij}^{*}$ is the $m\times m$ diagonal matrix where the $i$ -th diagonal entry is the $i$ -th row sum of $B$ , and $\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}^{*}A_{ij}$ is the $n\times n$ diagonal matrix where the $j$ -th diagonal entry is the $j$ -th column sum of $B$ . Therefore, ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced if and only if $B$ is ${\epsilon}$ -nearly doubly balanced. It should be clear that the square root of a scaling solution $L,R$ to $B$ is also a (diagonal) scaling solution to ${\cal A}$ .

Because of the special structure that each $A_{ij}$ has only one non-zero entry, there is always a scaling solution with $L,R$ being diagonal matrices if a scaling solution exists. To see this, let $L,R$ be a scaling solution to ${\cal A}$ with $\sum_{i,j}LA_{ij}RR^{*}A_{ij}^{*}L^{*}=\sum_{i,j}LA_{ij}RR^{*}A_{ij}^{*}L^{*}=sI_{m}/m$ and $\sum_{i,j}(LA_{ij}R)^{*}(LA_{ij}R)=\sum_{i,j}R^{*}A_{ij}^{*}L^{*}LA_{ij}R=sI_{n}/n.$ Define $D_{L}=(L^{*}L)^{1/2}$ . We claim that $D_{L},R$ is also a scaling solution to ${\cal A}$ and $D_{L}$ is a diagonal matrix. First, $\sum_{i,j}(D_{L}A_{ij}R)^{*}(D_{L}A_{ij}R)=\sum_{i,j}R^{*}A_{ij}^{*}D_{L}^{*}D_{L}A_{ij}R=\sum{i,j}R^{*}A_{ij}^{*}L^{*}LA_{ij}R=sI_{n}/n$ . Next, it follows from $\sum_{i,j}LA_{ij}RR^{*}A_{ij}L^{*}=sI_{m}/m$ that $(s/m)(L^{*}L)^{-1}=\sum_{i,j}A_{ij}RR^{*}A_{ij}$ , and this implies that $L^{*}L$ is a diagonal matrix as $\sum_{i,j}A_{ij}RR^{*}A_{ij}$ is a diagonal matrix because each $A_{ij}$ has only one non-zero entry. Finally, we check that

$\sum_{i,j}(D_{L}A_{ij}R)(D_{L}A_{ij}R)^{*}=D_{L}(\sum_{i,j}A_{ij}RR^{*}A_{ij}^{*})D_{L}^{*}=sD_{L}(L^{*}L)^{-1}D_{L}^{*}/m=sI_{m}/m$ . By the same argument, we can define $D_{R}=(RR^{*})^{1/2}$ so that $D_{L},D_{R}$ is also a scaling solution to ${\cal A}$ and both $D_{L}$ and $D_{R}$ are diagonal matrices. Therefore, we conclude that the matrix scaling problem can be reduced to the operator scaling problem. ∎

4.1.2 Spectral Condition

The spectral condition for operator scaling has a simple form for matrix scaling.

Lemma 4.3.

Using the reduction from Lemma 4.2, the spectral condition for operator scaling in Definition 1.4 becomes

[TABLE]

Proof.

Note that each $A_{l}\otimes A_{l}\in\mathbb{R}^{m^{2}\times n^{2}}$ has only one non-zero entry $B_{ij}$ , and $M_{{\cal A}}=\sum_{l}A_{l}\otimes A_{l}$ in Definition 1.4 has only an $m\times n$ submatrix with nonzero entries and this submatrix is exactly $B$ . So, the condition that $\sigma_{2}(M_{{\cal A}})\leq(1-\lambda)s(B)/\sqrt{mn}$ becomes $\sigma_{2}(B)\leq(1-\lambda)s(B)/\sqrt{mn}$ . ∎

4.1.3 Combinatorial Condition

To better understand the spectral gap condition in the matrix case, we present a natural combinatorial condition that implies the spectral condition.

Definition 4.4 (Edge-Weighted Bipartite Graph and Conductance).

Given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , we define its edge-weighted bipartite graph $G_{B}$ as follows. In $G_{B}$ , there is one vertex $u_{i}$ for each row $i$ , one vertex $v_{j}$ for each column $j$ , and an edge $ij$ with weight $w_{ij}=B_{ij}$ between $u_{i}$ and $v_{j}$ .

The conductance of an edge-weighted graph $G=(V,E)$ with $w:E\to\mathbb{R}_{\geq 0}$ is defined as

[TABLE]

Using Cheeger’s inequality from spectral graph theory, we can show that $B$ satisfies the spectral gap condition if its edge-weighted bipartite graph has large conductance.

Lemma 4.5.

If $B\in\mathbb{R}^{m\times n}$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq 1/2$ , then

[TABLE]

where $G_{B}$ is the edge-weighted bipartite graph of $B$ .

Proof.

The adjacency matrix $A_{G}$ of the edge-weighted bipartite graph $G_{B}$ is $\begin{bmatrix}0&B\\ B^{*}&0\end{bmatrix}$ . Note that if $\sum_{i}\sigma_{i}x_{i}y_{i}^{*}$ is the singular value decomposition of $B$ , then $A_{G}$ has eigenvalues $\{\pm\sigma_{i}\}$ and eigenvectors $\{(x_{i},\pm y_{i})\}$ . Therefore, $\sigma_{2}(B)=\lambda_{2}(A_{G})$ where $\lambda_{2}(A_{G})$ is the second largest eigenvalue of $A_{G}$ .

To relate $\sigma_{2}(B)$ to the conductance $\phi(G_{B})$ , we will consider the normalized adjacency matrix of $A_{G}$ and apply Cheeger’s inequality. The normalized adjacency matrix ${\mathbb{A}}$ of a matrix $A$ is defined as ${\mathbb{A}}:=D^{-1/2}AD^{-1/2}$ where $D$ is the diagonal degree matrix with $D_{i,i}:=\sum_{j}A_{i,j}$ . For $A_{G}$ , note that $D_{G}:=\begin{bmatrix}R&0\\ 0&C\end{bmatrix}$ , where $R\in\mathbb{R}^{m\times m}$ is the diagonal matrix with the $(i,i)$ -th entry being the $i$ -th row sum $r_{i}(B)$ of $B$ and $C\in\mathbb{R}^{n\times n}$ is the diagonal matrix with the $(j,j)$ -th entry being the $j$ -th column sum $c_{j}(B)$ of $B$ . Then,

[TABLE]

Let ${\mathbb{B}}=R^{-1/2}BC^{-1/2}$ . Note that $\sigma_{2}({\mathbb{B}})=\lambda_{2}({\mathbb{A}}_{G})$ by the argument in the first paragraph. Each entry of ${\mathbb{B}}$ is

[TABLE]

where we used the assumptions that $B$ is ${\epsilon}$ -nearly doubly balanced and ${\epsilon}\leq 1/2$ . Hence, we can write ${\mathbb{B}}=(\sqrt{mn}/s)B+{\mathcal{E}}$ , where ${\mathcal{E}}$ is the “error” matrix with $|{\mathcal{E}}_{ij}|\leq 2{\epsilon}\sqrt{mn}B_{ij}/s$ for all $i,j$ . By Lemma 3.19, $(\sqrt{mn}/s)\cdot\sigma_{2}(B)\leq\sigma_{2}({\mathbb{B}})+\left\lVert{\mathcal{E}}\right\rVert_{\rm op}$ . By the fact that the square of the largest singular value is at most the maximum row sum times the maximum column sum,

[TABLE]

where the last inequality uses that $r_{i}(B)\leq(1+{\epsilon})s/m$ for $1\leq i\leq m$ and $c_{j}(B)\leq(1+{\epsilon})s/n$ for $1\leq j\leq n$ . Finally, Cheeger’s inequality states that $\phi(G)\leq\sqrt{2(1-\lambda_{2}({\mathbb{A}}_{G}))}$ . Therefore, we conclude that

[TABLE]

∎

4.1.4 Random Matrices

One source of matrices satisfying the spectral condition is random matrices. If we generate $B\in\mathbb{R}_{\geq 0}^{m\times n}$ as a random bipartite graph (e.g. each entry is one with probability $p$ independently), then the resulting graph has $\phi(G_{B})=\Omega(1)$ with high probability by standard probabilistic method. Also, $B$ is ${\epsilon}$ -nearly doubly balanced for small ${\epsilon}$ by standard concentration inequality (e.g. ${\epsilon}=O(\sqrt{\log m/(pm)})$ in the above example). So, by Lemma 4.5, the $\lambda$ in Lemma 4.3 is $\Omega(1)$ , which implies that the assumption $\lambda^{2}\geq C{\epsilon}\ln m$ in Theorem 1.5 is satisfied with high probability. We can then apply our results to conclude that for those matrices:

The continuous operator scaling algorithm converges to a $\eta$ -nearly doubly balanced solution in time $t=O(\log(m/\eta))$ . 2. 2.

The condition number of the scaling solution is $O(1)$ from Theorem 1.7. 3. 3.

The capacity of the matrix is close to $s$ from Theorem 1.8.

Indeed, the assumption $\lambda^{2}\geq C{\epsilon}\ln m$ in Theorem 1.5 should hold for a large class of random non-negative matrices where each entry is an independent random variable with reasonable distribution such as the chi-squared distribution [58], and even for some limited dependent random matrices such as $k$ -wise independent random graphs. One can either verify the assumption by using the combinatorial condition in Lemma 4.5, or to bound the second largest singular value directly using the trace method as in Section 5.

4.1.5 Bipartite Matching

It is known that a matrix $B\in\mathbb{R}^{n\times n}$ can be scaled to arbitrarily close to doubly stochastic if and only if the underlying bipartite graph has a perfect matching [47], and so the decision version of the bipartite perfect matching problem can be reduced to the matrix scaling problem. Moreover, the doubly stochastic scaling solution provides a fractional solution to the perfect matching problem, which can be converted to an integral solution to the perfect matching problem very efficiently using the random walks technique in [23] (see also [48]).

Our results imply that the continuous operator scaling algorithm can be used to find a fractional perfect matching in an almost regular bipartite expander graph.

Corollary 4.6.

Suppose $G=(X,Y;E)$ is a bipartite graph with $|X|=|Y|$ where each vertex $v$ satisfies $(1-{\epsilon})|E|/|X|\leq\deg(v)\leq(1+{\epsilon})|E|/|X|$ for some ${\epsilon}$ . If $\phi(G)^{4}\geq C{\epsilon}\ln|X|$ for some sufficiently large constant $C$ , then the gradient flow converges to an $\eta$ -nearly doubly balanced scaling (i.e. $\eta$ -nearly perfect fractional matching) in time $t=O(\log|X|\log(1/\eta)/\phi^{2}(G))$ .

We remark that our results also imply that the second-order methods for matrix scaling in [13, 2] are near linear time algorithms for the instances in Corollary 4.6. This is because the condition number $\kappa$ of the scaling solution for those instances is a constant by Theorem 1.7 and the algorithms in [13, 2] have time complexity $\widetilde{O}(|E|\log\kappa)$ . We also note that classical combinatorial algorithms can also achieve a similar running time in the instances in Corollary 4.6.

4.1.6 Permanent Lower Bound

Given a matrix $A\in\mathbb{R}^{n\times n}$ , the permanent is defined as

[TABLE]

where $S_{n}$ is the set of all permutations of $n$ elements. Linial, Samorodnitsky, and Wigderson [47] used the matrix scaling algorithm to design a deterministic $e^{n}$ -approximation algorithm for computing the permanent of a non-negative $n\times n$ matrix. The algorithm works by scaling the input matrix to a doubly stochastic matrix and keeping track of the change of the permanent, and then use the results in Van der Waerden’s conjecture that any doubly stochastic matrix has permanent at least $n!/n^{n}$ and at most one to conclude the $e^{n}$ -approximation.

For matrices satisfying the spectral gap condition in Lemma 4.3 (e.g. random matrices in Section 4.1.4), we can use the capacity lower bound in Theorem 1.7 to argue that the continuous operator scaling algorithm doesn’t do much, and thus to establish a permanent lower bound for those matrices similar to that of Van der Waerden’s.

To see the proof, we first define the capacity of a matrix.

Definition 4.7 (Matrix Capacity).

Given a matrix $B\in\mathbb{R}^{m\times n}$ , define

[TABLE]

The following lemma is probably known but it was not stated in the literature.

Lemma 4.8.

Following the reduction in Lemma 4.2 from matrix scaling of $B$ to operator scaling of ${\cal A}$ , we have that ${\rm cap}(B)$ in Definition 4.7 is equivalent to ${\rm cap}({\cal A})$ in Definition 2.20.

Proof.

Recall that the capacity of an operator ${\cal A}$ is defined as

[TABLE]

Using the reduction from Lemma 4.2, given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , we define ${\cal A}=(A_{11},\ldots,A_{mn})$ where each $A_{ij}$ is the matrix with the $(i,j)$ -th entry equal to $\sqrt{B_{i,j}}$ and all other entries zero. Then, $\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}XA_{ij}$ is the $m\times m$ diagonal matrix with the $(i,i)$ -th entry equal to $\sum_{j=1}^{n}B_{i,j}X_{j,j}$ . If we let $x\in\mathbb{R}^{n}$ be the vector of the diagonal entries of $X$ , then the $(i,i)$ -th entry of $\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}XA_{ij}$ is simply $(Bx)_{i}$ . Then, the determinant of $\sum_{i=1}^{m}\sum_{j=1}^{n}A_{ij}XA_{ij}$ is simply $\prod_{i=1}^{m}(Bx)_{i}$ . Finally, by Hadamard’s inequality, $\det(X)\leq\prod_{j=1}^{n}X_{j,j}$ for any positive definite matrix $X$ , and so we can assume the optimizer to ${\rm cap}({\cal A})$ is a diagonal matrix, and thus ${\rm cap}({\cal A})$ simplifies to ${\rm cap}(B)$ in Definition 4.7. ∎

We are ready to prove the main result in this subsubsection.

Corollary 4.9.

If a non-negative matrix $B\in\mathbb{R}^{n\times n}$ is ${\epsilon}$ -nearly doubly balanced with $s(B)=n$ and it satisfies the $\lambda$ -spectral gap condition in Definition B.1 with $\lambda^{2}\geq C{\epsilon}\log n$ for some sufficiently large constant $C$ , then

[TABLE]

Proof.

Let $B\in R^{n\times n}$ be the input non-negative matrix with $s(B)=n$ . Find the scaling solution $L,R$ such that $LBR$ is doubly stochastic (i.e. every row sum and every column sum equal to one), which is guaranteed to exist under our assumptions. Gurvits [31, 29] defined the (unnormalized) capacity of $B\in\mathbb{R}^{n\times n}$ as

[TABLE]

Note that $\overline{{\rm cap}}(LBR)=\det(L)\cdot\det(R)\cdot\overline{{\rm cap}}(B)$ and also $\operatorname{per}(LBR)=\det(L)\cdot\det(R)\cdot\operatorname{per}(B)$ . Using the fact that $\overline{{\rm cap}}(A)=1$ for a doubly stochastic matrix $A$ [29, 20],

[TABLE]

Note that $\overline{{\rm cap}}(B)=({\rm cap}(B)/n)^{n}$ , and so the results on Van der Waerden’s conjecture imply that

[TABLE]

If $B$ is ${\epsilon}$ -nearly doubly balanced with $s(B)=n$ and $B$ satisfies the spectral gap condition in Definition B.1, then Theorem 1.8 and Lemma 4.8 imply that

[TABLE]

where ${\cal A}$ is the operator in the reduction from Lemma 4.2. Therefore, we conclude that

[TABLE]

∎

Example 4.10.

If $B$ is a random matrix where each entry $B_{ij}$ is an independent random variable $g_{ij}^{2}$ , where $g_{ij}$ is sampled from the normal distribution $N(0,1/n)$ , then $\lambda=\Omega(1)$ and ${\epsilon}=\sqrt{\log n/n}$ with high probability. Hence, the conditions in Corollary 4.9 are satisfied and it follows that

[TABLE]

So, the permanent of a random matrix from this distribution has a Van der Waerden’s type lower bound even though it is not doubly stochastic.

Barvinok and Samorodnitsky [6] proved an upper bound of the permanent of these matrices, and this implies a subexponential approximation of the permanent for these matrices.

4.1.7 Optimal Transport Distance

Given two probability distributions and a cost function $C$ , the optimal transport distance is the earth mover distance to move from one distribution to another distribution under the cost function. When the two probability distributions are discrete, the cost function can be represented as a cost matrix $C$ , and the problem of computing the optimal transport distance can be formulated as the assignment problem (i.e. a generalization of the minimum cost perfect matching). So the problem can be solved in polynomial time and there is a linear programming formulation for the problem. In large scale data analysis, however, the polynomial time algorithms are not fast enough.

Using the maximum entropy principle, Cuturi [14] proposed to add an entropic regularizer to the linear program, and showed that the optimal solution is the matrix scaling solution to a matrix $K$ associated to $C$ (more precisely $K_{i,j}=\exp(-C_{i,j}/\beta)$ where $\beta$ is a parameter in the regularizer). Cuturi showed that the Sinkhorn’s algorithm for matrix scaling is very efficient in computing the optimal solution to the regularized linear program, and he even mentioned that Sinkhorn’s algorithm exhibits linear convergence in practice [14]. Since then the “Sinkhorn distance” becomes a popular alternative/approximation to the earth mover distance and is used in computer vision and machine learning research; see the book [52] and the references therein. Theorem 1.5 provides a condition to establish the linear convergence observed, which is satisfied in many random matrices as discussed in Section 4.1.4.

Also, it is of interest to bound the Sinkhorn distance, which is shown in [14, 52] to be at most

[TABLE]

where $f^{*}$ and $g^{*}$ are the scaling solutions to $K$ and $\beta$ is the regularizer parameter. This result states that the distance is small if the condition number of the scaling solution is small. Theorem 1.7 provides a condition to bound the condition number to bound the Sinkhorn distance.

4.2 Frame Scaling

A frame is a collection of vectors $U=(u_{1},\ldots,u_{n})$ where each $u_{i}\in\mathbb{R}^{d}$ for $1\leq i\leq n$ . The size of a frame $U$ is defined as $s(U):=\sum_{i=1}^{n}\left\lVert u_{i}\right\rVert_{2}^{2}$ . A frame $U$ is called ${\epsilon}$ -nearly doubly balanced if

[TABLE]

and is called doubly balanced when ${\epsilon}=0$ .

Definition 4.11 (Frame Scaling Problem).

Given a frame $U=(u_{1},\ldots,u_{n})$ where each $u_{i}\in\mathbb{R}^{d}$ , the goal is to find a matrix $M\in\mathbb{R}^{d\times d}$ such that $v_{i}=Mu_{i}/\left\lVert Mu_{i}\right\rVert$ satisfies $\sum_{i=1}^{n}v_{i}v_{i}^{*}=I_{d}$ .

Outline: In the following, we will show that the frame scaling problem can be reduced to the operator scaling problem in Section 4.2.1. Then, we will see that the spectral condition has a nice form in Section 4.2.2, and explain that random frames will satisfy our condition in Section 4.2.3. Finally, we show a significant implication of our results to the Paulsen problem in Section 4.2.4 and a construction of doubly stochastic frame with small inner products in Section 4.2.5.

4.2.1 Reduction to Operator Scaling

The frame scaling problem is a special case of the operator scaling problem.

Lemma 4.12.

Given a frame $U=(u_{1},\ldots,u_{n})$ where each $u_{i}\in\mathbb{R}^{d}$ , let ${\cal A}=(A_{1},\ldots,A_{n})$ where each $A_{i}\in\mathbb{R}^{d\times n}$ for $1\leq i\leq n$ is the matrix with the $i$ -th column being $u_{i}$ and all other columns equal to zero. Then, $U$ is ${\epsilon}$ -nearly doubly stochastic if and only if ${\cal A}$ is ${\epsilon}$ -nearly doubly stochastic. Furthermore, there is a solution to the frame scaling problem for $U$ if and only if there is a solution to the operator scaling problem for ${\cal A}$ .

Proof.

By construction, $\sum_{i=1}^{n}A_{i}A_{i}^{*}=\sum_{i=1}^{n}u_{i}u_{i}^{*}\in\mathbb{R}^{d\times d}$ and $\sum_{i=1}^{n}A_{i}^{*}A_{i}=\operatorname{diag}(\{\left\lVert u_{i}\right\rVert_{2}^{2}\}_{i=1}^{n})\in\mathbb{R}^{n\times n}$ , and so $U$ is ${\epsilon}$ -nearly doubly stochastic if and only if ${\cal A}$ is ${\epsilon}$ -nearly doubly stochastic. If $M\in\mathbb{R}^{d\times d}$ is a solution to the frame scaling problem for $U$ , then we can set $L:=M$ and $R:=\operatorname{diag}(\{\left\lVert Mu_{i}\right\rVert^{-1}_{2}\}_{i=1}^{n})$ and see that it is a solution to the operator scaling problem for ${\cal A}$ .

If $L$ and $R$ is a solution to the operator scaling problem for ${\cal A}$ , then we can use a similar argument as in Lemma 4.2 to show that $L$ and $(RR^{*})^{1/2}$ is also a solution and $(RR^{*})^{1/2}$ is a diagonal matrix as ${\cal A}$ has the special structure that each $A_{i}$ has only one non-zero column. This is also proved in Lemma 3.7.4 in [45] so we omit the details. Since $R$ is diagonal, the $(i,i)$ -th entry must necessarily be $\left\lVert Lu_{i}\right\rVert^{-1}_{2}$ for the doubly stochastic conditions to be satisfied, and so $M:=L$ is a solution to the frame scaling problem for $U$ . ∎

4.2.2 Spectral Condition

The spectral condition for operator scaling is related to the following Hermitian matrix.

Definition 4.13 (Entrywise Squared Gram Matrix).

Given a frame $U=(u_{1},\ldots,u_{n})$ where each $u_{i}\in\mathbb{R}^{d}$ , the squared Gram matrix $G\in\mathbb{R}^{n\times n}$ is defined as $G_{i,j}=\langle u_{i},u_{j}\rangle^{2}$ for $1\leq i,j\leq n$ .

Note that $G$ is a positive semidefinite matrix. To see this, let $V$ be the $d\times n$ matrix with the $i$ -th column being $u_{i}$ . Then, we can write $G=(V^{*}V)\circ(V^{*}V)$ where $\circ$ denotes the Hadamard (or entrywise) product of two matrices. As $V^{*}V$ is a positive semidefinite matrix, $G$ is a positive semidefinite matrix by the Schur product theorem. The spectral condition in Definition 1.4 translates to the following spectral condition for the squared Gram matrix in the frame scaling case.

Lemma 4.14.

Using the reduction from Lemma 4.12, the spectral condition for operator scaling for ${\cal A}$ in Definition 1.4 becomes

[TABLE]

where $\lambda_{2}(G)$ is the second largest eigenvalue of $G$ .

Proof.

Since each $A_{i}$ has only one non-zero column, each $A_{i}\otimes A_{i}$ has only one non-zero column which is $u_{i}\otimes u_{i}\in\mathbb{R}^{d}$ . The matrix $M_{{\cal A}}\in\mathbb{R}^{d^{2}\times n^{2}}$ has only $n$ non-zero columns $(u_{1}\otimes u_{1},\ldots,u_{n}\otimes u_{n})$ . Hence, $M_{{\cal A}}^{*}M_{{\cal A}}$ has only a $n\times n$ non-zero submatrix, where the $(i,j)$ -th entry is $\langle u_{i}\otimes u_{i},u_{j}\otimes u_{j}\rangle=\langle u_{i},u_{j}\rangle^{2}$ . So, the $n\times n$ non-zero submatrix of $M_{{\cal A}}$ is exactly $G$ . Therefore, $\lambda_{2}(G)=\lambda_{2}(M_{{\cal A}}^{*}M_{{\cal A}})=\sigma_{2}(M_{{\cal A}})^{2}$ and the spectral condition $\sigma_{2}(M_{{\cal A}})\leq(1-\lambda)s({\cal A})/\sqrt{mn}$ is equivalent to $\lambda_{2}(G)\leq(1-\lambda)^{2}s(U)^{2}/(dn)$ as $s({\cal A})=s(U)$ and $m=d$ in the reduction from Lemma 4.12. ∎

4.2.3 Random Frames

In Section 5, we will prove that if we generate $\Omega(d^{4/3})$ random unit vectors, then the resulting frame is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}=O(1/\operatorname{poly}(d))$ and the $\lambda$ in Lemma 4.14 satisfies $\lambda=\Omega(1)$ with high probability. Hence, a random frame generated in this way will satisfy the condition $\lambda^{2}\geq C{\epsilon}\ln d$ and our results apply to these random frames. The proof is by a trace method. We believe that the trace method can be improved to prove that generating $\Omega(d\operatorname{polylog}d)$ random unit vectors will satisfy our condition.

4.2.4 The Paulsen Problem in Random Frames

Given an ${\epsilon}$ -nearly doubly balanced frame $U=(u_{1},\ldots,u_{n})$ with size $s(U)=d$ where each $u_{i}\in\mathbb{R}^{d}$ , the Paulsen problem asks to find a doubly balanced frame $V=(v_{1},\ldots,v_{n})$ that is “close” to $U$ . Given two frames $U,V$ , the squared distance between them is defined as $\operatorname{{\rm dist}^{2}}(U,V)=\sum_{i=1}^{n}\left\lVert u_{i}-v_{i}\right\rVert_{2}^{2}$ . It was an open question whether for every ${\epsilon}$ -nearly doubly balanced frame $U$ with $s(U)=d$ , there is always a doubly balanced frame $V$ with $\operatorname{{\rm dist}^{2}}(U,V)$ bounded by a function only dependent on $d$ and ${\epsilon}$ but independent of $n$ . Recently, this question was answered affirmatively in [45], showing that for any ${\epsilon}$ -nearly doubly balanced frame $U$ with $s(U)=d$ , there is always a doubly balanced frame $V$ with $\operatorname{{\rm dist}^{2}}(U,V)=O(d^{13/2}{\epsilon})$ . Very recently, Hamilton and Moitra [32] proved a stronger bound $O(d^{2}{\epsilon})$ with a much simpler proof. On the other hand, there are examples showing that the best bound is at least $\Omega(d{\epsilon})$ , so the upper bound and the lower bound are within a factor of $d$ .

The Paulsen problem was asked because it is difficult to generate doubly balanced frames and easier to generate nearly doubly balanced frames, but actually not many ways are known to even generate ${\epsilon}$ -nearly doubly balanced frames for small ${\epsilon}$ . Most nearly doubly balanced frames that we know are random frames (e.g. random Gaussian vectors, random unit vectors), which can be shown to be ${\epsilon}$ -nearly doubly balanced for small ${\epsilon}$ by matrix concentration inequalities (see Section 5.1). So, for the Paulsen problem, the inputs of interest are random frames.

We will prove that for a random frame $U$ with $s(U)=d$ that is ${\epsilon}$ -nearly doubly balanced, there is a doubly balanced frame $V$ with $\operatorname{{\rm dist}^{2}}(U,V)=O(d{\epsilon}^{2})$ with high probability, which is much smaller than the worst case $\Omega(d{\epsilon})$ bound. We will also show how this result can be used to generate a frame in which every pair of vectors has small inner product in the next subsubsection.

The proof has two steps. The first step is to show that if we generate $n=\Omega(d^{4/3})$ random unit vectors, then the resulting frame $U$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq O(1/\operatorname{poly}(d))$ and also satisfies the spectral gap condition in Lemma 4.14 with $\lambda=\Omega(1)$ . Therefore, the assumption in Theorem 1.5 is satisfied and the continuous operator scaling algorithm has linear convergence. The second step is to show that if the continuous operator scaling algorithm has linear convergence, then the “total movement” to a doubly balanced frame is $O(d{\epsilon}^{2})$ .

The first step will be proved in Section 5. We will prove the second step here. The following lemma states the result in [45] that we will use.

Lemma 4.15 (Theorem 3.3.5, Lemma 3.3.1, Lemma 3.4.3 in [45]).

The dynamical system in Definition 2.16 will move the input operator ${\cal A}^{(0)}$ to a doubly balanced operator ${\cal A}^{(\infty)}$ . For any time $T\geq 0$ ,

[TABLE]

The second step actually holds in the more general operator setting, not just in the frame setting.

Lemma 4.16.

Given an operator ${\cal A}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ with $m\leq n$ for $1\leq i\leq k$ , if ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced and ${\cal A}$ satisfies the $\lambda$ -spectral gap condition in Definition 1.4 with $\lambda^{2}\geq C{\epsilon}\ln m$ for a sufficiently large constant $C$ , then

[TABLE]

Proof.

Given the assumptions, Theorem 3.21 implies that

[TABLE]

By Lemma 4.15 and the above inequality, for any $T\geq 0$ ,

[TABLE]

where the last inequality is by Lemma 2.15. ∎

Combining the two steps gives the following theorem.

Theorem 4.17.

Let $U=(u_{1},\ldots,u_{n})$ be a random frame with $n=\Omega(d^{4/3})$ , where each $u_{i}\in\mathbb{R}^{d}$ is an independent random vector with $\left\lVert u_{i}\right\rVert_{2}^{2}=d/n$ . Then, with probability at least $0.99$ , there is a doubly balanced frame $V$ with $\operatorname{{\rm dist}^{2}}(U,V)\leq O(d{\epsilon}^{2})$ if $U$ is ${\epsilon}$ -nearly doubly balanced.

Proof.

By Theorem 5.1, the random frame $U$ satisfies the spectral gap condition in Lemma 4.14 with constant $\lambda$ and ${\epsilon}\ll 1/\ln d$ with probability at least $0.99$ . Note that Theorem 5.1 is stated when each $\left\lVert u_{i}\right\rVert_{2}^{2}=1$ but it is easy to see that the nearly doubly balanced condition and the spectral gap condition are unchanged upon scaling the vectors to $\left\lVert u_{i}\right\rVert_{2}^{2}=d/n$ for $1\leq i\leq n$ . By the reduction in Lemma 4.12 and the spectral gap condition in Lemma 4.14, this implies that the condition $\lambda^{2}\geq C{\epsilon}\ln d$ for operator scaling is satisfied and also $s(U)=d$ . Therefore, by Lemma 4.16, the continuous operator scaling algorithm will move $U$ to a doubly balanced frame $V$ with $\operatorname{{\rm dist}^{2}}(U,V)\leq O(d{\epsilon}^{2})$ . ∎

4.2.5 Constructing Frames with Small Inner Products

The original motivation for the Paulsen problem was to construct doubly balanced frames with some additional structure.

Definition 4.18.

A frame $V=\{v_{1},\ldots,v_{n}\}$ is equiangular if $\langle v_{i},v_{j}\rangle^{2}$ is the same for all $i\neq j$ .

For $n=\Theta(d^{2})$ , finding a doubly balanced frame that is also equiangular will have implications for certain informationally complete quantum measurement operators. It is a major open problem in frame theory for which pairs $(n,d)$ such frames exist [57]. The known examples are sporadic and based on group/number-theoretic constructions. We consider a related but more relaxed problem.

Definition 4.19.

A doubly balanced frame is Grassmannian if its angle

[TABLE]

is minimized over all possible doubly balanced frames.

Doubly balanced frames with small angle are useful in constructing erasure codes [36, 56]. The original motivation of the Paulsen problem was to begin with some ${\epsilon}$ -nearly doubly balanced frame $U$ that has small $\theta(U)$ , and see if it could be “rounded” to a nearby doubly balanced frame $V$ still having small $\theta(V)$ . Bounding $\operatorname{{\rm dist}^{2}}(U,V)$ is one way to achieve this goal.

In this section, we use the results in the spectral analysis to construct a doubly balanced frame with small angle. The idea is to start with a random frame $U$ which is ${\epsilon}$ -nearly doubly balanced for small ${\epsilon}$ and has small $\theta(U)$ with high probability, and then use the results in spectral analysis to show that we can scale $U$ to a doubly balanced frame $V$ with $\theta(V)\approx\theta(U)$ .

Theorem 4.20.

For any $n\geq\Omega(d^{4/3})$ , there exists a doubly balanced frame $V=(v_{1},\ldots,v_{n})$ where each $v_{i}\in\mathbb{R}^{d}$ with $\left\lVert v_{i}\right\rVert=1$ and

[TABLE]

Proof.

First, we generate a random frame $U=(u_{1},\ldots,u_{n})$ where each $u_{i}\in\mathbb{R}^{d}$ is an independent random unit vector with $\left\lVert u_{i}\right\rVert=1$ . By Lemma 5.3 and Theorem 5.1, $U$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq O(\sqrt{d\log d/n})$ and satisfies the $\lambda$ -spectral gap condition with $\lambda=\Omega(1)$ with probability at least $0.99$ . Next, we bound $\theta(U)$ using the following fact.

Fact 4.21 ([34]).

Let $x\in S^{d-1}$ be a fixed unit vector. For a random unit vector $u\sim S^{d-1}$ ,

[TABLE]

Choosing a large enough upper bound and applying union bound, it follows from the above fact and rotational invariance that

[TABLE]

By Theorem 3.21 and the reduction in Lemma 4.12, there is a left scaling matrix $L\in\mathbb{R}^{d\times d}$ and a right diagonal scaling matrix $R\in\mathbb{R}^{n\times n}$ such that if we set $v_{i}=Lu_{i}R_{ii}$ , then the frame $V=(v_{1},\ldots,v_{n})$ is doubly balanced. By Theorem 3.22, the scaling solutions $L,R$ satisfy

[TABLE]

Using the arguments as in Lemma 3.20 (or Lemma B.17), we have

[TABLE]

Therefore, we conclude that

[TABLE]

∎

For examples, when $n=\Theta(d^{2})$ the above theorem gives $\theta(V)\leq O(\log^{3}d/d)$ , and when $n=\Theta(d^{2}\log^{2}d)$ then the above theorem gives $\theta(V)\leq O(\log d/d)$ .

4.3 Operator Scaling

The operator scaling problem was used to the Brascamp-Lieb constant [21] and to compute the non-commutative rank of a symbolic matrix [20]. It is also used in [1] to solve the orbit intersection problem for the left-right group action.

4.3.1 Brascamp-Lieb Constants

A Brascamp-Lieb datum is specified by an $m$ -tuple ${\bf B}=\{B_{j}:\mathbb{R}^{n}\to\mathbb{R}^{n_{j}}\mid 1\leq j\leq m\}$ of linear transformations and an $m$ -tuple of exponents ${\bf p}=\{p_{1},\ldots,p_{m}\}$ . The Brascamp-Lieb constant ${\rm BL}({\bf B},{\bf p})$ of this datum is defined as the smallest $C$ such that for every $m$ -tuple $\{f_{j}:\mathbb{R}^{n_{j}}\to\mathbb{R}_{\geq 0}\mid 1\leq j\leq m\}$ of non-negative integrable functions, we have

[TABLE]

For this inequality to be scale invariant in $\{f_{1},\ldots,f_{m}\}$ , we must have $\sum_{j}p_{j}n_{j}=n$ . This is a common generalization of many useful inequalities; see [8, 21].

The important point we need is that the optimizers of any non-degenerate Brascamp-Lieb datum (i.e. the functions $f_{1},\ldots,f_{m}$ for which the inequality is tight) is achieved by density functions of appropriately centered Gaussians [46], and this implies that the Brascamp-Lieb constant ${\rm BL}({\bf B},{\bf p})$ can be written as the following optimization problem:

[TABLE]

which is closely related to the capacity of an operator.

An BL-datum is called geometric if we have:

[TABLE]

It is proved in [4, 5] that the BL-constant is one when the BL-datum is geometric. We will show that the BL-constant is small when the BL-datum is nearly geometric and satisfies a spectral condition, using the reduction in [21] from BL-constant to operator capacity and our capacity lower bound in Theorem 1.8.

Reduction: We describe the reduction in [21] from computing the BL-constant of a datum to computing the capacity of an operator. Let $p_{j}=c_{j}/d$ be rational numbers where $c_{j}$ and $d$ are integers. Given a BL-datum $({\bf B},{\bf p})$ , a completely positive map $\Phi_{{\cal A}}:\mathbb{R}^{nd\times nd}\to\mathbb{R}^{n\times n}$ is constructed as follows. For intuition, think of the “intended” input matrix $X$ to $\Phi_{{\cal A}}$ as a block diagonal matrix, with $c_{j}$ blocks of $X_{j}\in\mathbb{R}^{n_{j}\times n_{j}}$ for $1\leq j\leq m$ , so that $X$ is a square matrix with dimension $\sum_{j=1}^{m}c_{j}n_{j}=d\sum_{j=1}^{m}p_{j}n_{j}=dn$ . For each $B_{j}\in\mathbb{R}^{n_{j}\times n}$ in ${\bf B}$ , we create $c_{j}$ matrices $\{A_{j1},\ldots,A_{jc_{j}}\}$ in ${\cal A}$ , where each $A_{ji}\in\mathbb{R}^{n\times dn}$ has a copy of $B_{j}/\sqrt{d}$ that acts only on the $(j,i)$ -th principle block of $X$ (i.e. the $i$ -th copy of $X_{j}$ in $X$ ) and all other entries of $A_{ji}$ are zero. The operator ${\cal A}$ is defined by the Kraus operators $\cup_{j=1}^{m}\cup_{i=1}^{c_{j}}\{A_{ji}\}$ , with the completely positive map

[TABLE]

where $X_{ji}$ is the $(j,i)$ -th principle block of $X$ as described above, and the notation $\oplus$ denotes the direct sum of the matrices (i.e. putting each matrix in a diagonal block).

Theorem 4.22 ([21]).

It follows from the reduction that

[TABLE]

Using this connection, it is shown in [21] that the Brascamp-Lieb constant ${\rm BL}({\bf B},{\bf p})$ can be computed by an operator scaling algorithm for ${\cal A}$ .

Bounding BL-constants: Using Theorem 4.22, we would like to derive upper bounds on BL-constants using the capacity lower bound in Theorem 1.8, and show that for some random instances the BL-constant is small. To apply Theorem 1.8, we translate the definitions of ${\epsilon}$ -nearly doubly balanced operator and the $\lambda$ -spectral gap conditions to the Brascamp-Lieb setting. Following the reduction from ${\bf B},{\bf p}$ to ${\cal A}$ , we have the following definitions from the corresponding definitions of the operator ${\cal A}$ .

Definition 4.23 (Size of a Datum).

The size of a BL-datum $({\bf B},{\bf p})$ is

[TABLE]

The datum $({\bf B},{\bf p})$ is ${\epsilon}$ -nearly geometric if and only if the corresponding operator ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced.

Definition 4.24 (Nearly Geometric Datum).

A datum ${\rm BL}({\bf B},{\bf p})$ is ${\epsilon}$ -nearly geometric if

[TABLE]

The datum $({\bf B},{\bf p})$ satisfies the $\lambda$ -spectral gap condition if and only if the corresponding operator ${\cal A}$ satisfies the $\lambda$ -spectral gap condition.

Definition 4.25 (Spectral Gap of Datum).

Let $\bar{n}=\sum_{j=1}^{m}n_{j}$ and ${\bar{B}}^{*}\in\mathbb{R}^{n\times\bar{n}}$ be the matrix

[TABLE]

Let $\bar{B}_{j}\in\mathbb{R}^{\bar{n}\times n}$ be $\bar{B}$ with all but the $j$ -th block zeroed out, i.e. $\bar{B}_{j}^{*}:=[0,\ldots,0,B_{j}^{*},0,\ldots,0]$ . The natural matrix representation $M_{{\bf B},{\bf p}}\in\mathbb{R}^{\bar{n}^{2}\times n^{2}}$ of the datum $({\bf B},{\bf p})$ is defined as

[TABLE]

The datum $({\bf B},{\bf p})$ is said to have a $\lambda$ -spectral gap if

[TABLE]

With these definitions, we can state the Brascamp-Lieb constant upper bound that follows from the capacity lower bound in Theorem 1.8.

Corollary 4.26.

Given a datum $({\bf B},{\bf p})$ with $B_{j}:\mathbb{R}^{n}\to\mathbb{R}^{n_{j}}$ for $1\leq j\leq n$ and $\sum_{j=1}^{m}p_{j}n_{j}=n$ , if $({\bf B},{\bf p})$ is ${\epsilon}$ -nearly geometric and satisfies the $\lambda$ -spectral gap condition with $\lambda^{2}\geq C{\epsilon}\log n$ for some sufficiently large constant $C$ , then

[TABLE]

Let’s consider a concrete example to demonstrate the corollary.

Example 4.27.

An interesting special case of the Brascamp-Lieb inequality is the rank one case $B_{j}=u_{j}^{*}$ where $u_{j}\in\mathbb{R}^{d}$ and $n_{j}=1$ and $p_{j}=d/m$ for $1\leq j\leq m$ which was studied in [5]. Consider a random rank-one datum where each $u_{i}$ is an independent random unit vector of $\left\lVert u_{i}\right\rVert=1$ . Following the reduction,

[TABLE]

which is a form that is also studied in approximation algorithms [50]. Note that this is exactly the capacity of a frame $U=(u_{1},\ldots,u_{m})$ through the reduction in 4.12. By Theorem 5.1, if $m\geq\Omega(d^{4/3})$ , then $U$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq O(\sqrt{d\log d/m})$ and satisfies the $\lambda$ -spectral gap condition with $\lambda=\Omega(1)$ with high probability. Therefore, we can apply Theorem 1.8 to conclude that

[TABLE]

and from Corollary 4.26 the BL-constant for this datum is

[TABLE]

This is independent on the number of vectors $m$ and is much smaller than the worst case bound.

As another example, Hastings’ result [35] implies that a random operator where each $A_{i}$ is a random unitary has small Brascamp-Lieb constant with high probability.

4.3.2 Rank Non-Decreasing Operator

In [20, 19, 29], a polynomial time algorithm for computing the non-commutative rank of a symbolic matrix is designed using operator scaling. Given ${\cal A}=(A_{1},\ldots,A_{k})$ where each $A_{i}\in\mathbb{R}^{n\times n}$ , let $Z_{{\cal A}}=\sum_{i=1}^{k}x_{i}A_{i}$ be the symbolic matrix defined by ${\cal A}$ over non-commutative variables $x_{1},\ldots,x_{k}$ , the non-commutative rank ${\rm nc}$ - ${\rm rank}(Z)$ of $Z$ is defined as the smallest $r$ such that $Z=KM$ where $K$ is of dimension $n\times r$ and $M$ is of dimension $r\times n$ with entries in the “free skew field” of $x$ (see [20, 19] for definitions). The algorithm in [20, 19, 29] is based on the following equivalent characterizations.

Theorem 4.28 ([20, 19, 29]).

Given ${\cal A}=(A_{1},\ldots,A_{k})$ where each $A_{i}\in\mathbb{R}^{n\times n}$ , the following conditions are equivalent.

The symbolic matrix $Z_{{\cal A}}$ is singular, i.e. ${\rm nc}$ - ${\rm rank}(Z)<n$ . 2. 2.

${\cal A}$ * has a shrunk subspace, i.e. there exists subspaces $U,W$ with $\dim(W)<\dim(U)$ such that $A_{i}U\subseteq W$ for all $1\leq i\leq k$ .* 3. 3.

The completely positive linear map $\Phi_{{\cal A}}$ is rank decreasing, i.e. there exists $P\succ 0$ and $\operatorname{rank}(\Phi_{{\cal A}}(P))<\operatorname{rank}(P)$ .

The alternating scaling algorithm for operator scaling is used to check whether $\Phi_{{\cal A}}$ is rank non-decreasing. It is shown in [20, 19, 29] that $\Phi_{{\cal A}}$ is rank non-decreasing if and only if ${\cal A}$ can be scaled to ${\epsilon}$ -nearly balanced for ${\epsilon}\leq 1/\operatorname{poly}(n)$ , and so a polynomial time algorithm for operator scaling can be used to compute the non-commutative rank of a symbolic matrix over the reals.

The shrunk subspace condition is closely related to the concept of Hall-blocker in matching theory. In the matrix case, it is shown in Lemma 4.5 that a matrix $B$ satisfying the spectral condition is an almost regular bipartite expander graph, so there is no Hall-blocker and it always has a perfect matching as shown in Lemma 4.6. In the operator case, intuitively, the spectral condition is closely related to the notion of quantum expander (Section 2.1), and so there should be no Hall-blocker as well. Theorem 1.5 implies that it is the case.

Corollary 4.29.

Given an operator ${\cal A}$ satisfying the conditions of Theorem 1.5, $\Phi_{{\cal A}}$ is rank-nondecreasing and the corresponding symbolic matrix $Z_{{\cal A}}$ is non-singular over reals.

This is a new sufficient condition for an operator to be rank non-decreasing. We remark that the assumption can be weakened to $\lambda\geq 6{\epsilon}$ to get the same conclusion, but we omit the proof here.

4.3.3 The Operator Paulsen Problem

Given an ${\epsilon}$ -nearly doubly stochastic operator ${\cal A}=(A_{1},\ldots,A_{k})$ where each $A_{i}\in\mathbb{R}^{m\times n}$ , the operator Paulsen problem asks to find a doubly stochastic operator ${\mathcal{B}}=(B_{1},\ldots,B_{k})$ where each $B_{j}\in\mathbb{R}^{m\times n}$ with $\operatorname{{\rm dist}^{2}}({\cal A},{\mathcal{B}}):=\sum_{i=1}^{k}\left\lVert A_{i}-B_{i}\right\rVert_{F}^{2}$ . In [45], it was proved that $\operatorname{{\rm dist}^{2}}({\cal A},{\mathcal{B}})\leq O(mns{\epsilon})$ , and this result was used in [1] for the orbit intersection problem. For an operator ${\cal A}$ that satisfies the spectral gap condition with constant $\lambda$ , Lemma 4.16 implies a much stronger bound that $\operatorname{{\rm dist}^{2}}({\cal A},{\mathcal{B}})\leq O(s{\epsilon}^{2})$ .

5 Spectral Gap of Random Frames

In this section, we prove that a random frame is ${\epsilon}$ -nearly doubly stochastic for ${\epsilon}\ll 1/\ln d$ and satisfies the spectral gap condition for constant $\lambda$ with high probability.

Theorem 5.1.

If we generate $n$ random unit vectors $v_{1},\ldots,v_{n}$ in $\mathbb{R}^{d}$ with $n=\Omega(d^{4/3})$ , then the resulting frame is ${\epsilon}$ -nearly doubly stochastic for ${\epsilon}\ll 1/\ln d$ and satisfies the spectral gap condition in Definition 4.14 with constant $\lambda$ with probability at least $0.99$ .

To generate a random unit vector $v\in\mathbb{R}^{d}$ , we set each coordinate of $v$ to be an independent random Gaussian variable $N(0,\frac{1}{d})$ for $1\leq i\leq d$ , and then we scale the vector to have norm one. The size of the frame is $s=\sum_{i=1}^{n}\left\lVert v_{i}\right\rVert_{2}^{2}=n$ . By construction, the frame $V:=(v_{1},\ldots,v_{n})$ satisfies the equal norm condition.

In Section 5.1, we will prove that $V$ is ${\epsilon}$ -nearly doubly stochastic with high probability by using a standard matrix concentration bound. Then, in Section 5.2, we will prove that the squared Gram matrix $G$ in Definition 4.13 satisfies the spectral gap condition in Definition 4.14 with high probability by using the trace method.

5.1 Nearly Doubly Balanced Condition by Matrix Concentration

By construction, each vector $v_{i}$ has $\left\lVert v_{i}\right\rVert_{2}=1$ and $s=\sum_{i=1}^{n}\left\lVert v_{i}\right\rVert_{2}^{2}=n$ . So, for the nearly doubly stochastic condition, it remains to prove that $V=(v_{1},\ldots,v_{n})$ is ${\epsilon}$ -nearly Parseval for ${\epsilon}\ll 1/\log d$ with high probability when $n=\Omega(d^{4/3})$ , i.e.

[TABLE]

We establish this by using the following matrix Bernstein bound.

Theorem 5.2 (Matrix Bernstein [60]).

Let $X_{1},\ldots,X_{n}$ be independent random matrices in $\mathbb{R}^{d\times d}$ . Assume that, for $1\leq i\leq n$ ,

[TABLE]

and

[TABLE]

Then, for all $\ell\geq 0$ ,

[TABLE]

Lemma 5.3.

If we generate $n$ random unit vectors $v_{1},\ldots,v_{n}$ in $\mathbb{R}^{d}$ with $n=O(d\log d/{\epsilon}^{2})$ , then

[TABLE]

with probability at least $1-O(1/\operatorname{poly}(d))$ .

Proof.

To apply the matrix Bernstein bound, we consider the random matrix $X_{i}:=v_{i}v_{i}^{*}-\frac{1}{d}I_{d}$ for $1\leq i\leq n$ . We check the assumptions in Theorem 5.2. First, as the covariance matrix of a Gaussian vector is a scaled identity matrix and we scale it so that $\operatorname{tr}(v_{i}v_{i}^{*})=1$ , we have

[TABLE]

Second, as each $v_{i}v_{i}^{*}$ is of rank one, the operator norm of $X_{i}$ is achieved at $v_{i}$ and

[TABLE]

Finally, as each $X_{i}$ is Hermitian,

[TABLE]

and thus

[TABLE]

Therefore, we can bound the probability that the ${\epsilon}$ -Parseval condition is not satisfied by Theorem 5.2 with $\ell={\epsilon}n/d$ and $L=1-1/d$ , which gives

[TABLE]

Therefore, for ${\epsilon}\leq 1$ , by setting $n\geq\Omega(d\log d/{\epsilon}^{2})$ , this failure probability is at most inverse polynomial in $d$ . ∎

For our condition $\lambda^{2}\gg{\epsilon}\log d$ to be satisfied, it is sufficient for $\lambda=\Omega(1)$ that we will show and ${\epsilon}\ll 1/\log d$ , and Lemma 5.3 gives the following bound for the latter condition.

Corollary 5.4.

If we generate $n$ random unit vectors $v_{1},\ldots,v_{n}$ in $\mathbb{R}^{d}$ with $n=O(d\log^{3}d)$ , then

[TABLE]

for ${\epsilon}\ll 1/\log d$ with probability at least $1-O(1/\operatorname{poly}(d))$ .

5.2 Spectral Gap Condition by Trace Method

Our goal is to prove that

[TABLE]

when we generate $n=\Omega(d^{4/3})$ independent random unit vectors $v_{1},\ldots,v_{n}$ .

5.2.1 Trace Method

As in most results from random matrix theory, we use the trace method to bound $\lambda_{2}(G)$ .

Lemma 5.5.

For any natural number $k$ ,

[TABLE]

Proof.

Recall that $G$ is positive semidefinite from Section 4.2.2. Since all the eigenvalues of $G$ are non-negative, for any natural number $k$ , $\lambda_{2}(G)^{k}\leq\operatorname{tr}(G^{k})-\lambda_{1}(G)^{k}$ and thus $\mathbb{E}[\lambda_{2}(G)^{k}]\leq\mathbb{E}[\operatorname{tr}(G^{k})]-\mathbb{E}[\lambda_{1}(G)^{k}]$ . We bound the failure probabiliy by applying Markov’s inequality on the $k$ -th moment of $\lambda_{2}$ so that

[TABLE]

We lower bound the term $\mathbb{E}[\lambda_{1}(G)^{k}]$ by using the test vector $\vec{1}/\sqrt{n}$ so that

[TABLE]

where the second inequality is by Jensen’s inequality on the convex function $f(x)=x^{k}$ for integer $k\geq 1$ . Note that

[TABLE]

where the last equality follows from the independence of $v_{i}$ and $v_{j}$ for $i\neq j$ so that

[TABLE]

Putting the value of $\langle\vec{1},\mathbb{E}[G]\vec{1}\rangle$ gives $\mathbb{E}[\lambda_{1}(G)^{k}]\geq(1+(n-1)/d)^{k}$ , and thus

[TABLE]

∎

5.2.2 Expanding the Trace

To use the bound in Lemma 5.5, we need to compute $\mathbb{E}(\operatorname{tr}(G^{k}))$ . We expand the trace of $G^{k}$ as

[TABLE]

where the sum runs over all possible length $k$ words with letters in $\{1,\ldots,n\}$ with $i_{k+1}:=i_{1}$ . We interpret each term in the summation as a length $k$ closed walk in the complete graph of $n$ vertices, where $(i_{1},\ldots,i_{k},i_{1})$ are the vertices in the closed walk.

Let $\{e_{1},\ldots,e_{d}\}$ be an arbitrary orthonormal basis of $\mathbb{R}^{d}$ . To analyze the trace, we write $v_{i_{s}}=\sum_{a=1}^{d}\langle v_{i_{s}},e_{a}\rangle e_{a}$ as a linear combination of the basis vectors, and

[TABLE]

Expanding each term in the product $\prod_{s=1}^{k}\langle v_{i_{s}},v_{i_{s+1}}\rangle^{2}$ this way and and further expand the product, we can write

[TABLE]

We interpret each $a_{s}$ and $b_{s}$ as a color on the edge $(i_{s},i_{s+1})$ for $1\leq s\leq k$ . So, in this interpretation, the trace is summing over all possible closed $k$ walks on the complete graph of $n$ vertices, and all pairs of edge $d$ -coloring $a,b:[k]\to[d]$ on the edges $(i_{1},i_{2}),\ldots,(i_{k-1},i_{k}),(i_{k},i_{1})$ in the closed $k$ walk.

To calculate the expected value of the product terms in (5.2), we group the terms based on the vertices involved and use the following basic building block. The proof of the following lemma uses the normalization technique in the proof that $\operatorname{vol}(S^{d-1})=2\pi^{d/2}/\Gamma(d/2)$ in Ball’s survey [4], where $S^{d-1}$ denotes the unit sphere in $\mathbb{R}^{d}$ .

Lemma 5.6.

Let $\vec{q}=(q_{1},\ldots,q_{d})\in\mathbb{Z}_{\geq 0}^{d}$ with $q:=\sum_{i=1}^{d}q_{i}$ . Then

[TABLE]

where $\ell!!=\ell(\ell-2)\cdots(3)(1)$ for an odd number $\ell$ .

Proof.

Let $g\in\mathbb{R}^{d}$ be a random Gaussian vector where each coordinate is an independent Gaussian variable $g_{i}\sim N(0,1)$ . We will compute $\mathbb{E}_{g}\prod_{i=1}^{d}\langle g,e_{i}\rangle^{2q_{i}}$ in two ways to prove the lemma. On one hand,

[TABLE]

where the second equality follows from the formula for the even moments of a standard Gaussian variable (e.g. from wikipedia). On the other hand, we can compute the same quantity by a change of variables to the polar coordinates. Using that the density function of $g$ is $(2\pi)^{-d/2}\exp(-\left\lVert g\right\rVert_{2}^{2}/2)$ ,

[TABLE]

where the factor $r^{d-1}$ appears in the second equality because the sphere of radius $r$ has area $r^{d-1}$ times that of $S^{d-1}$ , and the last equality follows by a change of variable $u=\frac{1}{2}r^{2}$ and $du=rdr$ so that

[TABLE]

where the last equality follows from the definition of the Gamma function that $\Gamma(l):=\int_{0}^{\infty}u^{l-1}e^{-u}du$ . By combining the two equalities for $\mathbb{E}_{g}\prod_{i=1}^{d}\langle g,e_{i}\rangle^{2q_{i}}$ and using the fact that $\operatorname{vol}(S^{d-1})=2\pi^{d/2}/\Gamma(d/2)$ (e.g. from wikipedia), we have

[TABLE]

Using the fact that $\Gamma(l)=l\cdot\Gamma(l-1)$ and thus $\Gamma(\frac{d}{2}+q)=\Gamma(\frac{d}{2})\cdot(\frac{d}{2}+q-1)\cdot(\frac{d}{2}+q-2)\cdots(\frac{d}{2})$ , it implies that

[TABLE]

and the lemma follows. ∎

By taking the expectation of (5.1),

[TABLE]

For each closed $k$ -walk $i_{1},\ldots,i_{k},i_{1}$ , we need to compute the expectation of the product term. For some specific closed $k$ walks, it is easier to compute the expectation of the product term. In the next two subsubsections, we show how to compute the product terms when the closed $k$ -walk forms a tree or a cycle.

5.2.3 Tree Walk

The first simplification is that if there is any self-loop (i.e. $i_{s}=i_{s+1}$ ), then we can just remove the term $\langle v_{i_{s}},v_{i_{s+1}}\rangle^{2}$ from the product as $\left\lVert v_{i_{s}}\right\rVert_{2}=1$ by our construction.

The next simplification is that if the closed $k$ -walk looks like a tree, i.e. the edges $(i_{1},i_{2}),\ldots,(i_{k-1},i_{k}),(i_{k},i_{1})$ formed a tree when self-loops are removed and parallel edges are identified to a single edge, then the terms correspond to each edge in the tree can be computed independently using Lemma 5.6. This is because all non-neighbors in the tree are conditionally independent, and so we can iteratively fix all non-leaf vertices and compute the leaves independently.

Lemma 5.7.

Let $H=(V,E)$ be the graph formed by the edges $(i_{1},i_{2}),\ldots,(i_{k-1},i_{k}),(i_{k},i_{1})$ in a closed $k$ -walk. Suppose $H$ is a tree $T=(V,F)$ when self-loops are removed and parallel edges are identified to a single edge. For each edge $f=(i,j)\in F$ , let $q_{f}$ be the number of parallel edges of $f$ in $H$ . Then,

[TABLE]

where $\xi(q_{f}\chi_{1})$ denotes $\xi((q_{f},0,\ldots,0))$ in Lemma 5.6.

Proof.

We prove this by induction on $|V|$ . When $|V|=2$ , the statement follows from the rotational invariance of the distribution, so that $\mathbb{E}_{u}\mathbb{E}_{v}\langle u,v\rangle^{q}=\mathbb{E}_{u}\langle u,e_{1}\rangle^{q}=\xi(q)$ where $e_{1}$ is the first vector in the orthonormal basis $(e_{1},\ldots,e_{d})$ .

For the inductive step, let $L$ be the set of the leaves of the tree $T$ and $\delta(L)$ be the set of leaf edges in $T$ . By conditional expectation and independence of $v_{i}$ ,

[TABLE]

Since $|V\setminus L|<|V|$ , we can apply the induction hypothesis to obtain that the second term is equal to $\prod_{f\notin\delta(L)}\xi(2q_{f})$ . For the first term, note that each non-leaf vertex is fixed in the conditional expectation, and so by rotational invariance of the distribution and independence of $v_{i}$ ,

[TABLE]

where $\chi_{1}\in\mathbb{R}^{d}$ is the vector with the first entry one and other entries zero. The lemma follows by combining the two terms. ∎

5.2.4 Cycle Walk

We can also compute the expectation of a product term in (5.3) when the closed $k$ -walk is a simple cycle, i.e. the edges $(i_{1},i_{2}),\ldots,(i_{k-1},i_{k}),(i_{k},i_{i})$ form a cycle and the vertices $i_{1},\ldots,i_{k}$ are distinct.

Lemma 5.8.

Suppose the edges $(i_{1},i_{2}),\ldots,(i_{k-1},i_{k}),(i_{k},i_{i})$ form a simple cycle. Then

[TABLE]

Proof.

We use the expansion in (5.2) that

[TABLE]

where $(e_{1},\ldots,e_{d})$ is an orthonormal basis in $\mathbb{R}^{d}$ .

Since $S^{d-1}$ is symmetric across and half space, if any term $\langle v,e_{j}\rangle$ appears in a product term on the right hand side with odd degree, then that product term is equal to zero. So, we only focus on those product terms where each $\langle v,e_{j}\rangle$ has even degree. Since the edges $(i_{1},i_{2}),\ldots,(i_{k-1},i_{k}),(i_{k},i_{i})$ form a simple cycle, each vertex $i_{s}$ is involved in exactly four terms $\langle v_{i_{s}},e_{a_{s}}\rangle,\langle v_{i_{s}},e_{b_{s}}\rangle,\langle v_{i_{s}},e_{a_{s-1}}\rangle,\langle v_{i_{s}},e_{b_{s-1}}\rangle$ . We consider two cases of the $d$ -edge-colorings $a_{1},\ldots,a_{k}$ and $b_{1},\ldots,b_{k}$ .

The first case is when $a_{1}\neq b_{1}$ . Then, for $\langle v_{i_{2}},e_{a_{1}}\rangle$ and $\langle v_{i_{2}},e_{b_{1}}\rangle$ to have even degree, we must have $\{a_{2},b_{2}\}=\{a_{1},b_{1}\}$ . The same argument applies to every vertex, and thus we must have $\{a_{i},b_{i}\}=\{a_{j},b_{j}\}$ for $i\neq j$ , i.e. the same two colors appear in every edge in the simple cycle. There are two possibilities for each edge, either $a_{i}=a_{j},b_{i}=b_{j}$ or $a_{i}=b_{j},a_{j}=b_{i}$ . So, for each two colors, there are exactly $2^{k}$ such product terms. For each such product term, there are two colors that appear twice on each vertex, and so each such product term is exactly $\xi(\chi_{1,2})^{k}$ , where $\chi_{1,2}\in\mathbb{R}^{d}$ is the vector with the first two entries one and other entries zero. Therefore, the total contribution of these product terms is

[TABLE]

The second case is when $a_{1}=b_{1}$ . Then, for the terms in $i_{2}$ to have even degree, we must have $a_{2}=b_{2}$ , which could be the same color as $a_{1}=b_{1}$ or a different color. The same argument applies to every vertex, and thus we must have $a_{i}=b_{i}$ for $1\leq i\leq k$ , and so we can think of every edge in the cycle receives one color from $d$ colors. For each coloring, let $l$ be the number of vertices with two different colors of degree two (and so $k-l$ is the number of vertices with one color of degree four), then its contribution to the sum is

[TABLE]

To count the number of such colorings, we use the following fact.

Fact 5.9.

The number of proper $d$ -colorings of an $l$ -cycle is $(d-1)^{l}+(-1)^{l}(d-1)$ , where adjacent vertices receive different colors in a proper coloring. Since the line graph of an $l$ -cycle is also an $l$ -cycle, the number of proper $d$ -edge-colorings of an $l$ -cycle is also $(d-1)^{l}+(-1)^{l}(d-1)$ .

We would like to count the $d$ -edge-colorings of a $k$ -cycle with $l$ vertices with different colors on its two edges and $k-l$ vertices with the same color on its two edges. Notice that once we fix the location of the $l$ vertices with different colors, then the edges between any two such vertices must have the same color, and so we can think of the $k$ -cycle as an $l$ -cycle and each such coloring corresponds to a proper $d$ -edge-colorings of an $l$ -cycle. By enumerating the location of the $l$ vertices and using Fact 5.9, the number of such $d$ -edge-colorings is $\binom{k}{l}\cdot\left((d-1)^{l}+(-1)^{l}(d-1)\right)$ . Therefore, the total contribution of the second case is equal to

[TABLE]

where the last equality is by the binomial theorem. Combining the two cases,

[TABLE]

∎

5.2.5 Fourth Moment Analysis

We can use Lemma 5.7 and Lemma 5.8 to compute $\operatorname{tr}(G^{4})$ .

Lemma 5.10.

[TABLE]

Proof.

To compute $\mathbb{E}\operatorname{tr}(G^{4})$ , we only need to consider closed $4$ -walks $(i_{1},i_{2},i_{3},i_{4},i_{1})$ . We do a case analysis on the possible configurations of closed $4$ -walks.

There are four self loops, i.e. $i_{1}=i_{2}=i_{3}=i_{4}$ , in which case the contribution is simply one as the vectors are of length one by construction. There are total $n$ possibilities for the location of the self-loops, and so the total contribution in this case is $(L_{4}):=n$ . 2. 2.

There are two self loops and a single edge traversed two times. By Lemma 5.7, this graph contributes $\xi(2\chi_{1})=3/d(d+2)$ . There are $\binom{4}{2}$ places to add two self-loops to a single edge and $n(n-1)$ possibilities for the two vertices of the edge, so the total contribution in this case is

[TABLE] 3. 3.

The only other case with two distinct vertices is that an edge is traversed four times, and its contribution is $\xi(4\chi_{1})$ by Lemma 5.7. There are $n(n-1)$ for the location of the two vertices, and the total contribution in this case is

[TABLE] 4. 4.

There is one self loop and a $3$ -cycle. This graph contributes the same as a $3$ -cycle which is given by Lemma 5.8. There are $4$ places to add the self-loop and $n(n-1)(n-2)$ possibilities for the three vertices of the triangle, so the total contribution in this case is

[TABLE] 5. 5.

The only other case with three distinct vertices is two different edges sharing a single common vertex. By Lemma 5.7, this graph contributes $(\xi(2\chi_{1}))^{2}$ . Note that there are two ways to combine, as the two edges could share the starting vertex or the middle vertex. There are $n(n-1)(n-2)$ for the locations of the three vertices, and so the total contribution is

[TABLE] 6. 6.

Finally, the only case with four distinct vertices is a $4$ -cycle. There are $n(n-1)(n-2)(n-3)$ possibilities for the locations of the four vertices, and by Lemma 5.8 the total contribution is

[TABLE]

Combining all the cases,

[TABLE]

Taking the factor $n^{4}/d^{4}$ out proves the lemma. ∎

5.2.6 Proof of Theorem 5.1

We wrap up the fourth moment analysis to prove Theorem 5.1. Using Lemma 5.10 in Lemma 5.5, we have

[TABLE]

where we used $(1+(d-1)/n)^{4}\geq 1+(d-1)/n$ .

For any constant $\lambda$ , by generating $n\gg d^{4/3}$ random unit vectors, the probability that $\lambda_{2}(G)>(1-\lambda)^{2}n/d$ is at most $1/1000$ where the dominating term is $d^{4}/n^{3}$ .

Also, by Corollary 5.4, by generating $n=d\log^{3}d$ random unit vectors, the resulting frame is ${\epsilon}$ -nearly doubly stochastic with failure probability at most inverse polynomial in $d$ .

Therefore, by generating $n\gg d^{4/3}$ random unit vectors, with probability at least $0.99$ , the resulting frame is ${\epsilon}$ -nearly doubly stochastic for ${\epsilon}\ll 1/\log d$ and $\lambda_{2}(G)\leq(1-\lambda)^{2}\cdot n/d$ for any constant $0\leq\lambda<1$ . This proves Theorem 5.1.

Remark 5.11.

We believe that the trace method can be improved to prove the same conclusion with only $O(d\operatorname{polylog}d)$ random unit vectors.

Acknowledgement

We thank John Watrous for providing a proof of Lemma 3.6, and Nick Harvey for providing useful comments that improved the presentation of the paper.

Appendix A Operator Scaling

The following is a proof that the continuous operator scaling algorithm is equivalent to the gradient flow that always moves in the direction of minimizing $\Delta$ .

Lemma A.1.

Given an operator ${\cal A}=(A_{1},\ldots,A_{k})$ where $A_{i}\in\mathbb{R}^{m\times n}$ for $1\leq i\leq k$ , the direction defined by

[TABLE]

minimizes the function

[TABLE]

Proof.

As in Definition 2.14, we write

[TABLE]

Then

[TABLE]

Consider the directional derivative of $\Delta({\cal A})$ at the direction of ${\mathcal{H}}=(H_{1},\ldots,H_{k})$ where each $H_{i}\in\mathbb{R}^{m\times n}$ . For ease of notation, we write $E=E({\cal A})$ , $F=F({\cal A})$ and $s=s({\cal A})$ in the following, with the understanding that these are dependent on ${\cal A}$ and we are moving ${\cal A}$ in the direction ${\mathcal{H}}$ .

[TABLE]

where the third inequality uses the fact that $\operatorname{tr}(E)=0$ and $\operatorname{tr}(F)=0$ as stated in Definition 2.14. It follows that the direction $H_{i}:=EA_{i}+A_{i}F$ minimizes $\Delta({\cal A})$ . ∎

The following is an alternative proof of Lemma 3.6 provided by John Watrous.

Lemma A.2 (Watrous, personal communication).

If ${\cal A}$ is an ${\epsilon}$ -nearly doubly balanced operator, then the largest singular value of its matrix representation $M_{{\cal A}}$ in Definition 1.4 is

[TABLE]

Proof.

The proof is a generalization of the proof of Theorem 4.27 in [61]. As stated in Definition 2.6,

[TABLE]

where $\Phi(Y)$ is as defined in (2.1).

First, we bound the maximum for Hermitian matrix $Y$ . Let $Y=\sum_{i=k}^{n}\lambda_{k}y_{k}y_{k}^{*}$ be an eigenvalue decomposition of $H$ . Let

[TABLE]

Then, by Cauchy-Schwarz inequality and Hölder’s inequality for Schatten norms for matrices,

[TABLE]

Since $\Phi$ is a positive map, $\rho_{k}=\Phi(y_{k}y_{k}^{*})\succeq 0$ by Fact 2.9(2). It follows that the trace norm of $\rho_{k}$ is simply the trace of $\rho_{k}$ , and so

[TABLE]

where the third equality is by Fact 2.9(3) and the last inequality follows from the assumption that ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced. Therefore,

[TABLE]

where the second inequality is from the assumption that ${\cal A}$ is ${\epsilon}$ -nearly doubly balanced.

For the non-Hermitian case, we use a standard reduction and write $Y=H+iK$ where $H=(Y+Y^{*})/2$ and $K=(Y-Y^{*})/2i$ are Hermitian matrices. Note that $\left\lVert Y\right\rVert_{F}^{2}=\left\lVert H\right\rVert_{F}^{2}+\left\lVert K\right\rVert_{F}^{2}$ . As $\Phi$ is neccessarily Hermitian perserving, we also have $\left\lVert\Phi(Y)\right\rVert_{F}^{2}=\left\lVert\Phi(H)+i\Phi(K)\right\rVert_{F}^{2}=\left\lVert\Phi(H)\right\rVert_{F}^{2}+\left\lVert\Phi(K)\right\rVert_{F}^{2}$ . Therefore, as $H$ and $K$ are Hermitian,

[TABLE]

∎

Appendix B Matrix Scaling

The aim of this section is to provide a self-contained proof of the linear convergence result in the simpler setting of matrix scaling. It can be read as an exposition of the main ideas in Section 3.

In the matrix scaling problem, we are given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ , and the goal is to find a left diagonal scaling matrix $L\in\mathbb{R}^{m\times m}$ and a right diagonal scaling matrix $R\in\mathbb{R}^{n\times n}$ such that $LBR$ is doubly balanced, or report that such scaling matrices do not exist.

B.1 Definitions

In the following, we state the important definitions for the matrix scaling problem. Given a matrix $B\in\mathbb{R}^{m\times n}$ , we define

[TABLE]

as the size, the $i$ -th row sum, and the $j$ -th column sum of the matrix $B$ .

A matrix $B$ is ${\epsilon}$ -nearly doubly balanced if

[TABLE]

for $1\leq i\leq m$ and $1\leq j\leq n$ , and $B$ is doubly balanced when ${\epsilon}=0$ .

The $\ell_{2}$ -error of $B$ is defined as

[TABLE]

The spectral condition is the same as defined in Lemma 4.3.

Definition B.1 (Spectral Gap Condition for Matrix).

A matrix $B\in\mathbb{R}^{m\times n}$ satisfies the $\lambda$ -spectral gap condition if

[TABLE]

B.2 Continuous Matrix Scaling

The matrix scaling problem is a special case of the operator scaling problem. Following the reduction in Section 4.1, given a non-negative matrix $B\in R^{m\times n}$ , we consider the matrix $A\in R^{m\times n}$ where the $(i,j)$ -th entry of $A$ is

[TABLE]

The continuous matrix scaling algorithm works on $A$ and is defined by the following differential equation:

[TABLE]

Many quantities change over time in the dynamical system. We use the superscript (t) to denote the quantity of interest at time $t$ . Given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ as the input of the matrix scaling problem, the matrix $A$ in (B.4) is the input of the continuous operator scaling algorithm at time $t=0$ , i.e. $A^{(0)}:=A$ and $B^{(0)}:=B$ . Then $A^{(t)}$ changes over time following (B.5) and $B^{(t)}$ is defined as the matrix with $B^{(t)}_{ij}=(a^{(t)}_{ij})^{2}$ . The dynamical system stops when $B^{(t)}$ is doubly balanced. It is proved in [45] that $\Delta^{(\infty)}=0$ .

We state some known results about the continuous matrix scaling algorithm for the analysis. First, the matrix $A$ at any time is a scaling of the original matrix in the following form.

Lemma B.2 (Lemma 4.2.10 in [45]).

At time $T\geq 0$ , define $L^{(T)}\in\mathbb{R}^{m\times m}$ and $R^{(T)}\in\mathbb{R}^{n\times n}$ as

[TABLE]

Then $A^{(T)}=L^{(T)}A^{(0)}R^{(T)}$ .

In particular, if $\Delta^{(t)}=0$ , then $(L^{(t)})^{2}\cdot B\cdot(R^{(t)})^{2}$ is doubly balanced, and $(L^{(t)})^{2}$ and $(R^{(t)})^{2}$ is a solution to the matrix scaling problem. This is how the continuous operator scaling algorithm finds a scaling solution.

From now on, the matrix of interest is $B^{(t)}$ and it evolves over time as $A^{(t)}$ changes in the dynamical system. For ease of notation, we will omit the matrix $B^{(t)}$ and sometimes also the superscript (t) on other quantities when they are clear from the context.

Lemma B.3 (Lemma 3.6.1 in [45]).

For an ${\epsilon}$ -nearly doubly balanced matrix $B$ ,

[TABLE]

Lemma B.4 (Lemma 4.2.8 in [45]).

For any time $t\geq 0$ ,

[TABLE]

Lemma B.5 (Lemma 4.2.9 in [45]).

For any time $t\geq 0$ ,

[TABLE]

Lemma B.6 (Proposition 4.3.1 in [45]).

Suppose there exists $\mu>0$ such that for all $0\leq t\leq T$ ,

[TABLE]

Then

[TABLE]

B.3 Overview

The proof overview is stated in Section 1.5.2 in the matrix scaling setting, so we won’t repeat here. It is easy to see from Lemma B.5 that

[TABLE]

The structure is the same as in Section 3 for the general operator setting. Our goal is to prove the following theorem.

Theorem B.7 (Linear Convergence).

Given a non-negative matrix $B\in\mathbb{R}^{m\times n}$ with $m\leq n$ , if $B$ is ${\epsilon}$ -nearly doubly balanced and $B$ satisfies the $\lambda$ -spectral gap condition in Definition B.1 with $\lambda^{2}\geq C{\epsilon}\ln m$ for a sufficiently large constant $C$ , then in the gradient flow,

[TABLE]

In particular, the gradient flow converges to a $\eta$ -nearly doubly balanced scaling in time $t=O\left(\frac{1}{\lambda}\log(\frac{m}{\eta})\right)$ , and such a scaling always exists under our assumptions.

B.4 Lower Bounding the Quadratic Terms

First, we prove a structural result bounding the maximum error of the rows and columns, which will also be useful in bounding the condition number of the scaling solution later. Then, we will use this structural result to lower bound the quadratic terms of $-\Delta^{\prime}$ .

Proposition B.8.

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced, then for any $t\geq 0$ ,

[TABLE]

for $1\leq i\leq m$ and $1\leq j\leq n$ .

Proof.

We present a slightly informal proof, which can be made formal by using the envelope theorem stated in Theorem 3.3 as done in Proposition 3.2.

Let

[TABLE]

be the maximum violation of a row and a column at time $t$ . Note that $g(0)\leq{\epsilon}s^{(0)}$ as $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced. We would like to show that for almost every time $\tau\geq 0$ ,

[TABLE]

This would imply the proposition as

[TABLE]

where the second last equality is by Lemma B.4.

To bound ${\frac{d}{dt}}g(t)$ , we consider different cases of how the maximum of $g(t)$ is achieved. Suppose the maximum of $g(t)$ is achieved by column $j$ and $s^{(t)}-nc_{j}^{(t)}$ is negative such that $g(t)=-s^{(t)}+nc_{j}^{(t)}$ . The change of the $j$ -th column sum is

[TABLE]

where the last equality is by the definition of the dynamical system in (B.5), and the inequality is by our assumption that the maximum of $g(t)$ is achieved by column $j$ so that $s^{(t)}-nc_{j}^{(t)}=-g(t)$ and $s^{(t)}-mr_{i}^{(t)}\leq g(t)$ for all $1\leq i\leq m$ . It follows that

[TABLE]

where the first equality is by Lemma B.4.

Similarly, suppose the maximum of $g(t)$ is achieved by column $j$ and $s^{(t)}-nc_{j}^{(t)}$ is positive, we can show that

[TABLE]

By symmetry of rows and columns, we can prove the same bounds for the change of the violation of the $i$ -th row sum. Therefore, in all four cases, the change of the maximum violation is at most $2\Delta^{(t)}$ . Note that $g$ can be written as the maximum of $m+n$ functions, one for each row and one for each column. We can then use the envelope theorem in Theorem 3.3 as done in Proposition 3.2 to prove formally that $g(t)=g(0)+\int_{0}^{t}\frac{d}{d\tau}g(\tau)d\tau$ to complete the proof.

(It is possible to prove the proposition for the matrix case without using the envelope theorem as $g$ is only the maximum of a finite number of functions, but in the operator case $g(t)$ is the maximum quadratic form of infinitely many unit vectors and we don’t know of a proof without using the envelope theorem.) ∎

We have the following corollary about the row sums and the column sums by rewriting the conclusions of Proposition B.8.

Proposition B.9.

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced, then for any $t\geq 0$ , for $1\leq i\leq m$ and $1\leq j\leq n$ ,

[TABLE]

We can use Proposition B.9 to lower bound the quadratic terms in (B.6).

Lemma B.10.

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced, then for any $t\geq 0$ ,

[TABLE]

Proof.

Using Proposition B.9, the first term in (B.6) is

[TABLE]

Similarly, the second term in (B.6) is

[TABLE]

The lemma follows from $\Delta_{r}+\Delta_{c}=\Delta$ in (B.3). ∎

B.5 Upper Bounding the Cross Term

We will first bound the largest singular value of the matrix $B$ for any ${\epsilon}$ -nearly doubly balanced matrix $B$ . Then, we will use a spectral argument to upper bound the absolute value of the cross term in (B.6).

Lemma B.11.

If $B\in\mathbb{R}^{m\times n}$ is ${\epsilon}$ -nearly doubly balanced, then

[TABLE]

Proof.

We use the fact that the square of the largest singular value of a non-negative matrix is at most the maximum column sum times the maximum row sum (see e.g. page 223 of [38]). So,

[TABLE]

where the second inequality follows from the assumption that $B$ is ${\epsilon}$ -nearly doubly balanced. ∎

Lemma B.11 implies that $\vec{1}_{n}$ is an “approximate” first singular vector of $B$ . By the spectral gap condition in Definition B.1, it will follow that any vector perpendicular to $\vec{1}_{n}$ has a “small” quadratic form, and this can be used to bound the cross term in Lemma B.6. The following lemma summarizes the spectral argument, which is the same as Lemma 3.7. Since Lemma 3.7 has no operators involved, we refer to the proof in Section 3.2 and just restate the statement here for ease of reference.

Lemma B.12.

Let $M\in\mathbb{R}^{m\times n}$ . Let $p\in\mathbb{R}^{m}$ and $q\in\mathbb{R}^{n}$ be unit vectors. Suppose the following assumptions hold:

[TABLE]

Then, for any unit vectors $x\perp p$ and $y\perp q$ , it holds that $|x^{*}My|\leq 1+\delta_{1}-\delta_{2}.$

We can use Lemma B.12 to bound the cross term in Lemma B.6.

Lemma B.13.

If $B$ satisfies the spectral condition in Definition B.1 with the additional assumption that $\sigma_{1}(B)\leq(1+\delta)s/\sqrt{mn}$ for $\delta\leq 1$ , then

[TABLE]

Proof.

We apply Lemma B.12 with $M\in\mathbb{R}^{m\times n}$ , $p,x\in\mathbb{R}^{m}$ and $q,y\in\mathbb{R}^{n}$ where

[TABLE]

Clearly, $p$ , $q$ , $x$ , $y$ are unit vectors, and $x\perp p$ and $y\perp q$ . We check the assumptions of Lemma B.12. By the additional assumption,

[TABLE]

and so we can set $\delta_{1}:=2\delta+\delta^{2}$ . Similarly, by the spectral gap condition in Definition B.1,

[TABLE]

and so we can set $\delta_{2}:=2\lambda-\lambda^{2}$ . Also, we check that

[TABLE]

Therefore, we can conclude from Lemma B.1 that

[TABLE]

which implies that

[TABLE]

where the last inequality follows from $\sqrt{\Delta_{r}\Delta_{c}}\leq(\Delta_{r}+\Delta_{c})/2=\Delta/2$ and $\delta\leq 1$ and $\lambda\leq 1$ . ∎

B.6 Lower Bounding the Convergence Rate

Putting the bounds in Lemma B.10 and Lemma B.13 into (B.6), we obtain the following lower bound on the convergence rate of $\Delta$ at any time $t$ .

Proposition B.14.

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced and $B^{(t)}$ satisfies the spectral conditions that

[TABLE]

for $\delta^{(t)}\leq 1$ , then

[TABLE]

Note that Proposition B.14 implies that the dynamical system has linear convergence at time $t=0$ . To see this, note that $\delta^{(0)}\leq{\epsilon}$ by Lemma B.11, and $\lambda^{(0)}=\lambda$ from Definition B.1, and therefore

[TABLE]

Under our assumption that $\lambda\gg{\epsilon}$ , the dynamical system has linear convergence at time $t=0$ with rate at least $\lambda s^{(0)}$ .

To prove that the dynamical system has linear convergence with rate $\lambda s^{(0)}$ for all time $t\geq 0$ , we will prove that the quantities in Proposition B.14 do not change much when we move from $A^{(0)}$ to $A^{(t)}$ , i.e. $s^{(t)}\approx s^{(0)}$ , $\delta^{(t)}\approx\delta^{(0)}$ , and $\lambda^{(t)}\approx\lambda$ .

To bound the change of the singular values of $B^{(t)}$ , we will bound the condition number of the scaling solutions in the dynamical system in the next subsection, and then use these bounds to argue about the change of the singular values and establish Theorem B.7.

B.7 Condition Number

Recall from Lemma B.2 that $A^{(T)}=L^{(T)}A^{(0)}R^{(T)}$ where

[TABLE]

To bound the condition number of $L^{(T)}$ and $R^{(T)}$ , we bound the integrals in the exponent. To bound the integral, we divide the time into two phases. In the first phase, we use Proposition B.8 to argue that $|s^{(t)}-mr_{i}^{(t)}|\approx|s^{(0)}-mr_{i}^{(0)}|$ . In the second phase, we use that $\Delta^{(t)}$ is converging linearly to argue that $|s^{(t)}-mr_{i}^{(t)}|\leq\sqrt{m\Delta^{(t)}}$ is converging linearly. In the following lemma, we should think of $g$ as the spectral gap parameter $\lambda$ in Definition 1.4. The proof of the following lemma is almost identical to that in Lemma 3.16.

Lemma B.15.

Suppose there exists $g>0$ such that for all $0\leq t\leq T$ , it holds that

[TABLE]

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq g$ , then

[TABLE]

Proof.

To bound the condition number, we just need to bound $L^{(T)}_{ii}$ for each $1\leq i\leq m$ as $L^{(T)}$ is a diagonal matrix. Using the form of $L^{(T)}$ described in Lemma B.2, we bound the absolute value of the integral

[TABLE]

We split the integral into two terms. For the first term, we use Proposition B.8 to bound

[TABLE]

where the second inequality is by the fact that $s^{(t)}$ is non-increasing from Lemma B.4. Applying Lemma B.6 with our assumption that $\mu=gs^{(0)}$ , it follows that

[TABLE]

where the second inequality is by Lemma B.3, and the last inequality is by our assumption that $g\geq{\epsilon}$ .

For the second term,

[TABLE]

where the second inequality is from the inequality that $|s^{(t)}-mr_{i}^{(t)}|\leq\sqrt{m\Delta^{(t)}}$ from (B.3), and the third inequality follows from the assumption that $\Delta$ is converging linearly with $\mu=gs^{(0)}$ ; see Lemma B.6.

We choose

[TABLE]

This implies that

[TABLE]

and so the second term is at most $3{\epsilon}/g$ . The first term is at most $5\tau{\epsilon}s^{(0)}\leq 5{\epsilon}\ln m/g$ . Therefore, we conclude that

[TABLE]

∎

We cannot use the same argument to bound $\kappa(R^{(T)})$ , as it will only give us a bound with dependency on $n$ (where we assumed $m\leq n$ ). Instead, we use the bound on $\kappa(L^{(T)})$ to derive a similar bound on $\kappa(L^{(T)})$ . The proof of the following lemma is simpler than that of Lemma 3.18 in the operator case.

Lemma B.16.

Suppose there exists $g>0$ such that for all $0\leq t\leq T$ , it holds that

[TABLE]

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced for ${\epsilon}\leq g\leq 1$ , then $\max_{i}\{L^{(T)}_{ii}\}\leq e^{\ell}$ and $\min_{i}\{L^{(T)}_{ii}\}\geq e^{-\ell}$ implies that

[TABLE]

Proof.

By Lemma B.15,

[TABLE]

To upper bound $\left(R^{(T)}_{j,j}\right)^{2}$ , we consider the column sum by summing the above inequality over $i$ to get

[TABLE]

This implies that

[TABLE]

where the second inequality is by Proposition B.9 and that $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced.

Similarly, we can lower bound

[TABLE]

where the last inequality uses the assumption that $\Delta^{(t)}$ is converging linearly to apply Lemma B.6 with $\mu=gs^{(0)}$ to obtain

[TABLE]

where we used Lemma B.3 and the assumption that ${\epsilon}\leq g$ . ∎

B.8 Invariance of Linear Convergence

We will first use Lemma B.15 and Lemma B.16 to bound the change of the singular values of $B^{(t)}$ . Then, we will combine the previous results to prove Theorem B.7 that $\Delta^{(t)}$ is converging linearly for all $t\geq 0$ .

Lemma B.17.

For any $t\geq 0$ , suppose the diagonal matrices $L^{(t)}\in\mathbb{R}^{m\times m}$ and $R^{(t)}\in\mathbb{R}^{n\times n}$ satisfy $\left\lVert L^{(t)}-I_{m}\right\rVert_{\rm op}\leq\zeta$ and $\left\lVert R^{(t)}-I_{n}\right\rVert_{\rm op}\leq\zeta$ for some $\zeta\leq 1$ , then

[TABLE]

Proof.

We use Lemma 3.19 to bound the singular value change by the operator norm of the matrix change:

[TABLE]

We write $L^{(t)}=I+\widetilde{L}$ and $R^{(t)}=I+\widetilde{R}$ and $B=B^{(0)}$ , so that $\left\lVert\widetilde{R}\right\rVert_{\rm op}\leq\zeta$ and $\left\lVert\widetilde{C}\right\rVert_{\rm op}\leq\zeta$ by our assumptions. Then,

[TABLE]

where we used the triangle inequality and bound the sum of the eight operator norms, and used the fact that $\left\lVert XBY\right\rVert_{\rm op}\leq\left\lVert X\right\rVert_{\rm op}\left\lVert Y\right\rVert_{\rm op}\left\lVert B\right\rVert_{\rm op}$ for each term, and used the assumption that $\left\lVert\widetilde{L}\right\rVert_{\rm op},\left\lVert\widetilde{R}\right\rVert_{\rm op}\leq\zeta\leq 1$ so that each term is at most $O(\zeta)\left\lVert B\right\rVert_{\rm op}$ . ∎

We are ready to put together the results to prove the following theorem which implies Theorem B.7. The proof is almost the same as that of Theorem 3.21.

Theorem B.18.

If $B^{(0)}$ is ${\epsilon}$ -nearly doubly balanced and $B^{(0)}$ satisfies the $\lambda$ -spectral gap condition in Definition B.1 with $\lambda^{2}\geq C{\epsilon}\ln m$ for a sufficiently large constant $C$ , then for all $t\geq 0$ it holds that

[TABLE]

Proof.

Recall from Proposition B.14 the definitions of $\delta^{(t)}$ and $\lambda^{(t)}$ , and $\delta^{(0)}\leq{\epsilon}$ by Lemma B.11 and $\lambda^{(0)}=\lambda$ from Definition B.1. Let $T$ be the supremum such that $s^{(t)}\geq(1-{\epsilon})s^{(0)}$ and $\lambda^{(t)}-3\delta^{(t)}\geq\frac{1}{2}(\lambda^{(0)}-3\delta^{(0)})$ . Our goal is to prove that $\Delta^{(t)}$ is converging linearly for $0\leq t\leq T$ and $T$ is unbounded.

First, we show that $\Delta^{(t)}$ is converging linearly for $0\leq t\leq T$ . By Proposition B.14,

[TABLE]

where in the second inequality we used that $s^{(t)}\geq(1-{\epsilon})s^{(0)}$ and $\lambda^{(t)}-3\delta^{(t)}\geq\frac{1}{2}(\lambda^{(0)}-3\delta^{(0)})$ for $0\leq t\leq T$ . Note that our assumption implies that $\lambda^{(0)}=\lambda\geq C{\epsilon}$ for a sufficiently large constant $C$ as $\lambda\leq 1$ . Since $\delta^{(0)}\leq{\epsilon}$ from Lemma B.11, it follows that for any $0\leq t\leq T$ ,

[TABLE]

Next, we argue that the size condition and the spectral gap condition will still be maintained beyond time $T$ . For the size change, by Lemma B.6 with $\mu=\lambda s^{(0)}$ ,

[TABLE]

where the second inequality is by Lemma B.3 and the last inequality is by $\lambda\geq C{\epsilon}$ for a sufficiently large constant $C$ .

For the change of the second largest singular value, by definition,

[TABLE]

On the other hand, we can upper bound $\sigma_{2}(B^{(T)})-\sigma_{2}(B^{(0)})$ using condition numbers. Using Lemma B.15 with $g=\lambda$ , $\max_{i}\{L^{(T)}_{ii}\}\leq\exp\left(O({\epsilon}\ln m/\lambda)\right)$ and $\min_{i}\{L^{(T)}_{ii}\}\geq\exp\left(-O({\epsilon}\ln m/\lambda)\right)$ . Note that our assumption implies that

[TABLE]

where the implication is by the inequality $e^{x}-1\leq O(x)$ for $x$ close to zero. Then, by Lemma B.16, we also have $\left\lVert R^{(T)}-I\right\rVert_{\rm op}\leq O\left(\lambda/C\right)$ . Putting these bounds into $\zeta$ of Lemma B.17, we obtain

[TABLE]

Combining the upper bound and lower bound and using $\delta_{1}^{(0)}\leq{\epsilon}$ from Lemma B.11, it follows that

[TABLE]

where the last inequality is by the assumption that $\lambda\geq C{\epsilon}$ .

For the change of the largest singular value, by Proposition B.9,

[TABLE]

where the first and last inequalities use that $s^{(T)}\geq(1-{\epsilon})s^{(0)}$ . The same holds for $\operatorname{diag}(\{c_{j}^{(T)}\}_{j=1}^{n})$ and these imply that ${\cal A}^{(T)}$ is $3{\epsilon}$ -nearly doubly balanced. By Lemma B.11, this implies that $\delta^{(T)}\leq 3{\epsilon}$ . Therefore,

[TABLE]

where the second last inequality uses that $C$ is a sufficiently large constant.

Since our dynamical system is continuous, we still have both conditions satisfied at time $T+\eta$ for some $\eta>0$ , which contradicts that $T$ is the supremum that both conditions are satisifed. Therefore, $T$ is unbounded and the linear convergence of $\Delta$ is maintained throughout the execution of the dynamical system. ∎

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. Allen-Zhu, A. Garg, Y. Li, R. Oliveira, A. Wigderson. Operator scaling via geodescially convex optimization, invariant theory and polynomial identity testing . In Proceeedings of the 50th Annual ACM Symposium on Theory of Computing (STOC), 172–181, 2018.
2[2] Z. Allen-Zhu, Y. Li, R. Oliveira, A. Wigderson. Much faster algorithms for matrix scaling . In Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2017.
3[3] M. F. Atiyah. Convexity and Commuting Hamiltonians . Bulletin of the London Mathematical Society, Vol 14, Issue 1, Jan 1982.
4[4] K. Ball. Volumes of sections of cubes and related problems . Geometric Aspects of Functional Analysis, 251–260, 1989.
5[5] F. Barthe. On a reverse form of the Brascamp-Lieb inequality . Inventiones mathematicae 134(2), 335–361, 1998.
6[6] A. Barvinok, A. Samorodnitsky. Computing the partition function for perfect matchings in a hypergraph . Combinatorics, Probability, and Computing, 20(6), 2011.
7[7] A. Ben-Aroya, O. Schwartz, A. Ta-Shma. Quantum expanders: motivation and construction . Theory of Computing 6, 47–79, 2010.
8[8] J. Bennett, A. Carbery, M. Christ, T. Tao. The Brascamp-Lieb inequalities: finiteness, structure, and extremals . GAFA Geom. funct. anal. (2008) 17: 1343.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Spectral Analysis of Matrix Scaling and Operator Scaling

1 Introduction

Definition 1.1** (Operator Scaling Problem).**

1.1 Previous Algorithms

Theorem 1.2** ([54, 18, 20, 19]).**

Theorem 1.3** ([13, 2, 1]).**

1.2 Gradient Flow

1.3 Contributions

Spectral Condition

Definition 1.4** (Spectral Gap Condition).**

Linear Convergence

Theorem 1.5** (Linear Convergence).**

Corollary 1.6** (Gradient Descent).**

Condition Number

Theorem 1.7** (Condition Number).**

Operator Capacity

Theorem 1.8** (Capacity).**

1.4 Applications of Matrix Scaling and Operator Scaling

1.4.1 Matrix Scaling

Corollary 1.9**.**

Corollary 1.10**.**

1.4.2 Frame Scaling

Theorem 1.11**.**

Theorem 1.12**.**

Theorem 1.13**.**

1.4.3 Operator Scaling

Corollary 1.14**.**

1.5 Techniques

1.5.1 Comparisons with Previous Techniques

1.5.2 Outline of Spectral Analysis

1.6 Organization

2 Preliminaries

2.1 Positive Linear Maps, Matrix Representations, Quantum Expanders

2.1.1 Completely Positive Linear Map

Definition 2.1** (Completely Positive Map).**

Theorem 2.2** (Choi [12]).**

Definition 2.3** (Doubly Balanced Map).**

Definition 2.4** (Natural Matrix Representation of Linear Map).**

Fact 2.5** (Proposition 2.20 in [61]).**

2.1.2 Spectral Gap Condition and Quantum Expanders

Definition 2.6** (Spectral Gap Condition of Φ\PhiΦ).**

Definition 2.7** (Quantum Expander [35, 7]).**

2.1.3 Choi Matrix and Useful Facts

Definition 2.8** (Choi Matrix).**

Fact 2.9**.**

2.2 Continuous Operator Scaling

2.2.1 Operator Scaling

Definition 2.10** (Operator).**

Definition 2.11** (Size of an Operator).**

Definition 2.12** (ϵ{\epsilon}ϵ-nearly Doubly Balanced Operator).**

Definition 2.13** (ℓ2\ell_{2}ℓ2​-error).**

Definition 2.14** (Error Matrices).**

Lemma 2.15** (Lemma 3.6.1 in [45]).**

2.2.2 Dynamical System

Definition 2.16** (Dynamical System).**

Lemma 2.17** (Lemma 3.4.2 in [45]).**

Lemma 2.18** (Lemma 3.4.3 in [45]).**

Lemma 2.19** (Proposition 4.3.1 in [45]).**

2.2.3 Operator Capacity

Definition 2.20** (Capacity).**

Proposition 2.21** (Proposition 4.3.1 in [45]).**

3 Spectral Analysis of Operator Scaling

3.1 Overview

Lemma 3.1**.**

Proof.

3.2 Lower Bounding the Quadratic Terms

Proposition 3.2**.**

Proof.

Theorem 3.3** (Corollary 4 in Milgrom and Segal [49]).**

Proposition 3.4**.**

Lemma 3.5**.**

Proof.

3.3 Upper Bounding the Cross Term

Lemma 3.6**.**

Definition 1.1 (Operator Scaling Problem).

Theorem 1.2 ([54, 18, 20, 19]).

Theorem 1.3 ([13, 2, 1]).

Definition 1.4 (Spectral Gap Condition).

Theorem 1.5 (Linear Convergence).

Corollary 1.6 (Gradient Descent).

Theorem 1.7 (Condition Number).

Theorem 1.8 (Capacity).

Corollary 1.9.

Corollary 1.10.

Theorem 1.11.

Theorem 1.12.

Theorem 1.13.

Corollary 1.14.

Definition 2.1 (Completely Positive Map).

Theorem 2.2 (Choi [12]).

Definition 2.3 (Doubly Balanced Map).

Definition 2.4 (Natural Matrix Representation of Linear Map).

Fact 2.5 (Proposition 2.20 in [61]).

Definition 2.6 (Spectral Gap Condition of $\Phi$ ).

Definition 2.7 (Quantum Expander [35, 7]).

Definition 2.8 (Choi Matrix).

Fact 2.9.

Definition 2.10 (Operator).

Definition 2.11 (Size of an Operator).

Definition 2.12 ( ${\epsilon}$ -nearly Doubly Balanced Operator).

Definition 2.13 ( $\ell_{2}$ -error).

Definition 2.14 (Error Matrices).

Lemma 2.15 (Lemma 3.6.1 in [45]).

Definition 2.16 (Dynamical System).

Lemma 2.17 (Lemma 3.4.2 in [45]).

Lemma 2.18 (Lemma 3.4.3 in [45]).

Lemma 2.19 (Proposition 4.3.1 in [45]).

Definition 2.20 (Capacity).

Proposition 2.21 (Proposition 4.3.1 in [45]).

Lemma 3.1.

Proposition 3.2.

Theorem 3.3 (Corollary 4 in Milgrom and Segal [49]).

Proposition 3.4.

Lemma 3.5.

Lemma 3.6.

Lemma 3.7.

Lemma 3.8.

Proposition 3.9.

Definition 3.10.

Theorem 3.11 (Theorem 2.5.1 in [55]).

Corollary 3.12.

Definition 3.13 (Condition Number).

Theorem 3.14 (Corollary 3.4.3 in [55]).

Corollary 3.15.

Lemma 3.16.

Remark 3.17.

Lemma 3.18.

Lemma 3.19 (Theorem 3.3.16 in [37]).

Lemma 3.20.

Theorem 3.21.

Theorem 3.22.

Theorem 3.23.

Remark 3.24.

Definition 4.1 (Matrix Scaling Problem).

Lemma 4.2.

Lemma 4.3.

Definition 4.4 (Edge-Weighted Bipartite Graph and Conductance).

Lemma 4.5.

Corollary 4.6.

Definition 4.7 (Matrix Capacity).

Lemma 4.8.

Corollary 4.9.

Example 4.10.

Definition 4.11 (Frame Scaling Problem).

Lemma 4.12.

Definition 4.13 (Entrywise Squared Gram Matrix).

Lemma 4.14.