Partial minimization of strict convex functions and tensor scaling

Shmuel Friedland

arXiv:1905.11384·math.OC·June 6, 2019

Partial minimization of strict convex functions and tensor scaling

Shmuel Friedland

PDF

Open Access

TL;DR

This paper introduces a partial minimization algorithm for strict convex functions, demonstrating geometric convergence and connecting it to classical methods like conjugate gradient and Sinkhorn scaling for matrices and tensors.

Contribution

It presents a novel partial minimization algorithm with proven convergence for convex functions and links tensor and matrix scaling to partial minimization of log-convex functions.

Findings

01

Algorithm converges geometrically to the unique minimum.

02

Connection established between Sinkhorn scaling and partial minimization.

03

Generalization of conjugate gradient method for quadratic polynomials.

Abstract

Assume that f is a strict convex function with a unique minimum in R^n. We divide the vector of n-variables to d groups of vector subvariables with d at least two. We assume that we can find the partial minimum of f with respect to each vector subvariable while other variables are fixed. We then describe an algorithm that partially minimizes each time on a specifically chosen vector subvariable. This algorithm converges geometrically to the unique minimum. The rate of convergence depends on the uniform bounds on the eigenvalues of the Hessian of f in the compact sublevel set f whose values are at most f(x_0), where x_0 is the starting point of the algorithm. In the case where f is a polynomial of degree two, with positive definite quadratic term, and d=n our method can be considered as a generalization of the classical conjugate gradient method. The main result of this paper is the…

Equations103

∥ x ∥ \to \infty lim f (x) = \infty.

∥ x ∥ \to \infty lim f (x) = \infty.

min {f (x), x_{j} \in R^{m_{j}}, x = (x^{j}, x_{j})} = f (x^{j}, x_{j} (x^{j})) .

min {f (x), x_{j} \in R^{m_{j}}, x = (x^{j}, x_{j})} = f (x^{j}, x_{j} (x^{j})) .

D (u) B D (v) 1_{m} = r, 1_{l}^{⊤} D (u) B D (v) = c^{⊤},

D (u) B D (v) 1_{m} = r, 1_{l}^{⊤} D (u) B D (v) = c^{⊤},

f (u, v) = i = j = 1 \sum l, m b_{i, j} e^{u_{i} + v_{j}} .

f (u, v) = i = j = 1 \sum l, m b_{i, j} e^{u_{i} + v_{j}} .

∣ t_{k} - t^{⋆} ∣ \leq \frac{∥\nabla f ( x _{1} ) ∥ ^{2}}{α ^{2} ( t _{1} )} q = 1 \prod k - 1 (1 - \frac{1}{( d - 1 ) κ ( t _{q} )}),

∣ t_{k} - t^{⋆} ∣ \leq \frac{∥\nabla f ( x _{1} ) ∥ ^{2}}{α ^{2} ( t _{1} )} q = 1 \prod k - 1 (1 - \frac{1}{( d - 1 ) κ ( t _{q} )}),

∥ x_{k} - x^{⋆} ∥^{2} \leq \frac{∥\nabla f ( x _{1} ) ∥ ^{2}}{α ( t _{1} ) α ( t _{k} )} q = 1 \prod k - 1 (1 - \frac{1}{( d - 1 ) κ ( t _{q} )}) .

g_{y} (t) \geq g_{y} (1) + g_{x}^{'} (1) (t - 1) \geq f (x^{⋆}) + ν (t - 1) for t \geq 1

g_{y} (t) \geq g_{y} (1) + g_{x}^{'} (1) (t - 1) \geq f (x^{⋆}) + ν (t - 1) for t \geq 1

V (t) = {y \in R^{n}, f (y) \leq t}, t = f (x)

V (t) = {y \in R^{n}, f (y) \leq t}, t = f (x)

f (x) + \nabla f (x)^{⊤} (y - x) + \frac{α ( t _{0} )}{2} ∥ y - x ∥^{2} \leq f (y) \leq

f (x) + \nabla f (x)^{⊤} (y - x) + \frac{α ( t _{0} )}{2} ∥ y - x ∥^{2} \leq f (y) \leq

f (x) + \nabla f (x)^{⊤} (y - x) + \frac{β ( t _{0} )}{2} ∥ y - x ∥^{2} .

f (x^{⋆}) + \frac{α ( t _{0} )}{2} ∥ x - x^{⋆} ∥^{2} \leq f (x) \leq f (x^{⋆}) + \frac{β ( t _{0} )}{2} ∥ x - x^{⋆} ∥^{2} .

f (x^{⋆}) + \frac{α ( t _{0} )}{2} ∥ x - x^{⋆} ∥^{2} \leq f (x) \leq f (x^{⋆}) + \frac{β ( t _{0} )}{2} ∥ x - x^{⋆} ∥^{2} .

x^{+} = x - \frac{1}{β ( t _{0} )} \nabla f (x), x^{++} = x - \frac{1}{α ( t _{0} )} \nabla f (x) .

x^{+} = x - \frac{1}{β ( t _{0} )} \nabla f (x), x^{++} = x - \frac{1}{α ( t _{0} )} \nabla f (x) .

x^{a} = x - \frac{2}{β ( t _{0} )} \nabla f (x) .

x^{a} = x - \frac{2}{β ( t _{0} )} \nabla f (x) .

f (x^{+}) \leq

f (x^{+}) \leq

f (x) + \nabla f (x)^{⊤} (x^{+} - x) + \frac{β ( t _{0} )}{2} ∥ x^{+} - x ∥^{2} = f (x) - \frac{∥\nabla f ( x ) ∥ ^{2}}{2 β ( t _{0} )} .

R (x)^{2} = \frac{2}{α ( t _{0} )} (f (x) - f (x^{⋆})) \leq \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )},

R (x)^{2} = \frac{2}{α ( t _{0} )} (f (x) - f (x^{⋆})) \leq \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )},

x^{⋆} \in B (x^{++}, \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )} - \frac{2}{α ( t _{0} )} (f (x) - f (x^{⋆})) \subseteq

B (x^{++}, \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )} (1 - \frac{1}{κ ( t _{0} )}) - \frac{2}{α ( t _{0} )} (f (x^{+}) - f (x^{⋆})),

\frac{∥\nabla f ( x ) ∥}{β ( t _{0} )} \leq ∥ x - x^{⋆} ∥.

∥ x^{⋆} - x^{++} ∥^{2} = ∥ (x^{⋆} - x + \frac{1}{α ( t _{0} )} \nabla f (x) ∥^{2} =

∥ x^{⋆} - x^{++} ∥^{2} = ∥ (x^{⋆} - x + \frac{1}{α ( t _{0} )} \nabla f (x) ∥^{2} =

∥ (x^{⋆} - x ∥^{2} + \frac{2}{α ( t _{0} )} \nabla f (x)^{⊤} (x^{⋆} - x) + \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )} .

f (x^{⋆}) \geq f (x) + \nabla f (x)^{⊤} (x^{⋆} - x) + \frac{α ( t _{0} )}{2} ∥ x^{⋆} - x ∥^{2} .

f (x^{⋆}) \geq f (x) + \nabla f (x)^{⊤} (x^{⋆} - x) + \frac{α ( t _{0} )}{2} ∥ x^{⋆} - x ∥^{2} .

∥ x^{⋆} - x^{++} ∥^{2} \leq \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )} - \frac{2}{α ( t _{0} )} (f (x) - f (x^{⋆})) .

∥ x^{⋆} - x^{++} ∥^{2} \leq \frac{∥\nabla f ( x ) ∥ ^{2}}{α ^{2} ( t _{0} )} - \frac{2}{α ( t _{0} )} (f (x) - f (x^{⋆})) .

\frac{∥\nabla f ( x ) ∥ ^{2}}{2 β ( t _{0} )} \leq f (x) - f (x^{+}) \leq f (x) - f (x^{⋆}) \leq \frac{β ( t _{0} )}{2} ∥ x - x^{⋆} ∥^{2} .

\frac{∥\nabla f ( x ) ∥ ^{2}}{2 β ( t _{0} )} \leq f (x) - f (x^{+}) \leq f (x) - f (x^{⋆}) \leq \frac{β ( t _{0} )}{2} ∥ x - x^{⋆} ∥^{2} .

f (x_{k}) - f (x^{⋆}) \leq

f (x_{k}) - f (x^{⋆}) \leq

(f (x_{0}) - f (x^{⋆})) (1 - \frac{1}{d κ ( t _{0} )}) i = 1 \prod k - 1 (1 - \frac{1}{( d - 1 ) κ ( t _{i} )}) \leq

\frac{∥\nabla f ( x _{0} ) ∥ ^{2}}{2 α ( t _{0} )} (1 - \frac{1}{d κ ( t _{0} )}) i = 1 \prod k - 1 (1 - \frac{1}{( d - 1 ) κ ( t _{i} )}) \leq

\frac{∥\nabla f ( x _{0} ) ∥ ^{2}}{2 α ( t _{0} )} (1 - \frac{1}{d κ ( t _{0} )}) (1 - \frac{1}{( d - 1 ) κ ( t _{0} )})^{k - 1},

∥ x_{k} - x^{⋆} ∥^{2} \leq \frac{∥\nabla f ( x _{0} ) ∥ ^{2}}{α ( t _{k} ) α ( t _{0} )} (1 - \frac{1}{d κ ( t _{0} )}) i = 1 \prod k - 1 (1 - \frac{1}{( d - 1 ) κ ( t _{i} )}) .

∥\nabla f (x) ∥^{2} = l = 1 \sum d ∥ \nabla_{l} f (x) ∥^{2} \Rightarrow max {∥ \nabla_{l} f (x) ∥, l \in [d]} \geq \frac{∥\nabla f ( x ) ∥}{d} .

∥\nabla f (x) ∥^{2} = l = 1 \sum d ∥ \nabla_{l} f (x) ∥^{2} \Rightarrow max {∥ \nabla_{l} f (x) ∥, l \in [d]} \geq \frac{∥\nabla f ( x ) ∥}{d} .

k \to \infty lim x_{k} = x^{⋆}, k \to \infty lim α (t_{k}) = α (t^{⋆}), k \to \infty lim β (t_{k}) = β (t^{⋆}), k \to \infty lim κ (t_{k}) = κ (t^{⋆}) .

k \to \infty lim x_{k} = x^{⋆}, k \to \infty lim α (t_{k}) = α (t^{⋆}), k \to \infty lim β (t_{k}) = β (t^{⋆}), k \to \infty lim κ (t_{k}) = κ (t^{⋆}) .

f (x_{0}) - f (x_{1}) = g (x_{j_{0}, 0}) - g (x_{j_{0}}^{⋆}) \geq

f (x_{0}) - f (x_{1}) = g (x_{j_{0}, 0}) - g (x_{j_{0}}^{⋆}) \geq

\frac{∥\nabla g ( x _{j_{0}, 0} ∥ ^{2}}{2 β ( t _{0} )} = \frac{∥\nabla f _{j_{0}} ( x _{0} ) ∥ ^{2}}{2 β ( t _{0} )} \geq \frac{∥\nabla f ( x _{0} ) ∥ ^{2}}{2 β ( t _{0} ) d} .

\frac{f ( x _{1} ) - f ( x ^{⋆} )}{f ( x _{0} ) - f ( x ^{⋆} )} = 1 - \frac{f ( x _{0} ) - f ( x _{1} )}{f ( x _{0} ) - f ( x ^{⋆} )} \leq

\frac{f ( x _{1} ) - f ( x ^{⋆} )}{f ( x _{0} ) - f ( x ^{⋆} )} = 1 - \frac{f ( x _{0} ) - f ( x _{1} )}{f ( x _{0} ) - f ( x ^{⋆} )} \leq

\displaystyle 1-\big{(}\frac{\|\nabla f(\mathbf{x}_{0})\|^{2}}{2\beta(t_{0})d}\big{)}/\big{(}\frac{\|\nabla f(\mathbf{x}_{0})\|^{2}}{2\alpha(t_{0})}\big{)}=1-\frac{1}{d\kappa(t_{0})}.

d \geq 2, m_{j} \geq 2 for j \in [d] .

d \geq 2, m_{j} \geq 2 for j \in [d] .

s_{k, i_{k}} := i_{j} \in [m_{j}], j \in [d] \ {k} \sum a_{i_{1}, \dots, i_{d}}, i_{k} \in [m_{k}], k \in [d]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Optimization Algorithms Research · Tensor decomposition and applications

Full text

Partial minimization of strict convex functions and tensor scaling

Shmuel Friedland

Department of Mathematics and Computer Science, University of Illinois at Chicago, Chicago, Illinois, 60607-7045, USA

[email protected]

(Date: June 4, 2019)

Abstract.

Assume that $f\in\mathrm{C}^{2}(\mathbb{R}^{n})$ is a strict convex function with a unique minimum. We divide the vector of $n$ variables to $d\geq 2$ groups of vector subvariables. We assume that we can find the partial minimum of $f$ with respect to each vector subvariable while other variables are fixed. We then describe an algorithm that partially minimizes each time on a specifically chosen vector subvariable. This algorithm converges geometrically to the unique minimum. The rate of convergence depends on the uniform bounds on the eigenvalues of the Hessian of $f$ in the compact sublevel set $f(\mathbf{x})\leq f(\mathbf{x}_{0})$ , where $\mathbf{x}_{0}$ is the starting point of the algorithm. In the case where $f(\mathbf{x})=\mathbf{x}^{\top}A\mathbf{x}+\mathbf{b}^{\top}\mathbf{x}$ and $d=n$ our method can be considered as a generalization of the classical conjugate gradient method. The main result of this paper is the observation that the celebrated Sinkhorn diagonal scaling algorithm for matrices, and the corresponding diagonal scaling of tensors, can be viewed as partial minimization of certain logconvex functions.

2010 Mathematics Subject Classification. 15A39, 15A69, 52A41, 65F35, 65K05.

Keywords: Partial minimization of strict convex functions, positive diagonal scaling of nonnegative tensors, prescribed slice sums, discrete Schrödinger bridge problem, Sinkhorn algorithm.

1. Introduction

Let $f\in\mathrm{C}^{2}(\mathbb{R}^{n})$ is a strict convex function, that is, the Hessian $H(f)(\mathbf{x})$ is positive definite for each $\mathbf{x}\in\mathbb{R}^{n}$ . We assume that $f$ has a minimum at $\mathbf{x}^{\star}\in\mathbb{R}^{n}$ , which is necessary unique. It is well known that a necessary and sufficient condition for the existence of $\mathbf{x}^{\star}$ is:

[TABLE]

See Lemma 2.1. We now recall the notion of partial minimization of $f$ . For $m\in\mathbb{N}$ denote $[m]=\{1,\ldots,m\}\subset\mathbb{N}$ . Divide the vector $\mathbf{x}=(x_{1},\ldots,x_{n})^{\top}$ to $d\geq 2$ groups: $\mathbf{x}^{\top}=(\mathbf{x}_{1}^{\top},\ldots,\mathbf{x}_{d}^{\top})$ , where $\mathbf{x}_{i}\in\mathbb{R}^{m_{i}}$ for $i\in[d]$ and $\sum_{i=1}^{p}m_{i}=n$ . (Thus $d\in[n]\setminus\{1\}$ .) View $\mathbf{x}$ as $(\mathbf{x}^{j},\mathbf{x}_{j})$ where $\mathbf{x}^{j}\in\mathbb{R}^{n-m_{j}}$ is obtained from $\mathbf{x}$ by deleting the vector coordinate $\mathbf{x}_{j}$ . Denote by $\nabla_{j}f(\mathbf{x})\in\mathbb{R}^{m_{j}}$ the vector of derivatives of $f(\mathbf{x})$ with respect to the coordinates in $\mathbf{x}_{j}$ . Minimize $f(\mathbf{x})$ with respect to the variable $\mathbf{x}_{j}$ while keeping all other variable fixed:

[TABLE]

Our main assumption is that we can find $\mathbf{x}_{j}(\mathbf{x}^{j})$ either precisely, or with a prescribed accuracy. This assumption holds if $f$ is a polynomial of degree $2$ $f(\mathbf{x})=\mathbf{x}^{\top}A\mathbf{x}+\mathbf{b}^{\top}\mathbf{x}+c$ , where $A$ is a symmetric positive definite matrix and $d=n$ . This is the classical case of the conjugate gradient [11]. The main point of this paper is to show that this assumption holds if we consider the classical scaling algorithm of Sinkhorn [19], or more general tensor scaling problem [2, 16, 7, 8]. Matrix scaling problems arise in several areas of applied and pure mathematics. There are many available algorithms to achieve the scaling. See [1] for a historical survey and for new suggested algorithms. The main purpose of this paper to show that matrix and tensor scaling could be efficiently implemented using our simple algorithm which ensures geometric convergence. While for matrices our algorithm reduces to alternating scaling, for tensors the algorithm chooses the order of scaling.

We now state briefly our algorithm:

Algorithm

Choose $\mathbf{x}_{0}\in\mathbb{R}^{n}$ .
for $k:=0,1,2,\ldots$
$j\in\operatorname{arg\,max}\{\|\nabla_{l}f(\mathbf{x}_{k})\|,l\in[d]\}$
$\mathbf{x}_{k+1}=(\mathbf{x}_{k}^{j},\mathbf{x}_{j}(\mathbf{x}_{k}^{j}))$
end

We show that this algorithm converges geometrically to $\mathbf{x}^{\star}$ with at least a factor $(1-\frac{\alpha}{\sqrt{d-1}\beta})$ , where $\alpha$ and $\beta$ are the minimum and the maximum of the lowest and highest eigenvalues of $H(f)$ respectively in the compact convex sublevel region $\{\mathbf{x},f(\mathbf{x})\leq f(\mathbf{x}_{0})\}$ .

Note that if $d=2$ , i.e., $\mathbf{x}=(\mathbf{x}_{1},\mathbf{x}_{2})$ , then after one iteration the above minimization algorithm is an alternating minimization, as in the Sinkhorn algorithm. Instead of using the standard coordinates $\mathbf{x}=(x_{1},\ldots,x_{n})^{\top}$ we can use the coordinates $\hat{\mathbf{x}}=P\mathbf{x}$ , where the $n$ rows of $P$ : $\mathbf{p}_{1}^{\top},\ldots,\mathbf{p}_{n}^{\top}$ are linearly independent. In the conjugate gradient algorithm we need to choose the vectors $\mathbf{p}_{1},\ldots,\mathbf{p}_{n}$ to be orthogonal with respect to $A$ : $\mathbf{p}_{i}^{\top}A\mathbf{p}_{j}=0$ for $i\neq j$ [11].

We now explain briefly why Sinkhorn scaling algorithm for matrices can be stated as a partial minimization of strict convex function. For simplicity of exposition ourselves mainly to positive rectangular matrices $B=[b_{i,j}]\in\mathbb{R}^{l\times m}$ . For $\mathbf{u}=(u_{1},\ldots,u_{l})^{\top}\in\mathbb{R}^{l}$ we denote by $D(\mathbf{u})\in\mathbb{R}^{l\times l}$ the diagonal matrix with the diagonal entries $e^{u_{1}},\ldots,e^{u_{l}}$ . Let $\mathbf{1}_{n}=(1,\ldots,1)^{\top}\in\mathbb{R}^{n}$ and assume $\mathbf{r}=(r_{1},\ldots,r_{l})^{\top},\mathbf{c}=(c_{1},\ldots,c_{m})^{\top}$ are given positive vectors satisfying $\mathbf{1}_{l}^{\top}\mathbf{r}=\mathbf{1}_{m}^{\top}\mathbf{c}$ . The scaling problem is finding $\mathbf{u},\mathbf{v}$ such that the matrix $D(\mathbf{u})BD(\mathbf{v})$ has rows and column sums $\mathbf{r}$ and $\mathbf{c}$ respectively:

[TABLE]

for some $\mathbf{u}\in\mathbb{R}^{l},\mathbf{v}\in\mathbb{R}^{m}$ . Clearly, this problem is equivalent to the scaling problem when we replace $\mathbf{r},\mathbf{c}$ with $b\mathbf{r},b\mathbf{c}$ for some positive $b>0$ . For a given nonzero vector $\mathbf{w}\in\mathbb{R}^{n}$ denote by $\mathrm{L}(\mathbf{w})=\{\mathbf{x}\in\mathbb{R}^{n},\mathbf{w}^{\top}\mathbf{x}=0\}$ . The the dimension of $\mathrm{L}(\mathbf{w})$ is $n-1$ and we identify $\mathrm{L}(\mathbf{w})$ with $\mathbb{R}^{n-1}$ . Let

[TABLE]

Clearly, $f(\mathbf{x}),\mathbf{x}=(\mathbf{u},\mathbf{v})$ is a convex function on $\mathbb{R}^{l+m}$ . We consider the restriction of $f$ to $L(\mathbf{r})\times L(\mathbf{c})$ . Since $B>0$ it follows that $f(\mathbf{x})$ is strictly convex on $L(\mathbf{r})\times L(\mathbf{c})$ and the condition (1.1) holds, see Section 4. Let $\mathbf{x}^{\star}=(\mathbf{u}^{\star},\mathbf{v}^{\star})\in\mathrm{L}(\mathbf{r})\times\mathrm{L}(\mathbf{c})$ be the minimum point of $f|\mathrm{L}(\mathbf{r})\times\mathrm{L}(\mathbf{c})$ . Use Lagrange multipliers to deduce that $D(\mathbf{u}^{\star})BD(\mathbf{v}^{\star})$ has row and column sums $b\mathbf{r},b\mathbf{c}$ for some $b>0$ . Fix $\mathbf{v}\in\mathrm{L}(\mathbf{c})$ and find partial minimum of $\min\{f(\mathbf{u},\mathbf{v}),\mathbf{u}\in\mathrm{L}(\mathbf{r})\}$ . Use Lagrange multipliers to deduce that this minimum is achieved at unique $\mathbf{u}(\mathbf{v})$ such that the row sums of $D(\mathbf{u}(\mathbf{v}))BD(\mathbf{v})$ are of the form $b\mathbf{r}$ . We now give a simple formula for $\mathbf{v}$ . Observe first that the equality $D(\mathbf{u})BD(\mathbf{v})\mathbf{1}_{m}=\mathbf{r}$ is uniquely solvable by $\tilde{u}_{i}=\log r_{i}-\log(BD(\mathbf{v})\mathbf{1}_{m})_{i}$ for $i\in[l]$ . Let $\tilde{\mathbf{u}}(\mathbf{v})=(\tilde{u}_{1},\ldots,\tilde{u}_{l})^{\top}$ . Note that $\tilde{\mathbf{u}}(\mathbf{v})$ is the scaling part of Sinkhorn algorithm. Then $\mathbf{u}(\mathbf{v})=\tilde{\mathbf{u}}(\mathbf{v})-a\mathbf{1}_{l}$ , where $a=\mathbf{r}^{\top}\tilde{\mathbf{u}}(\mathbf{v})/(\mathbf{r}^{\top}\mathbf{1}_{l})$ . Similarly, for a fixed $\mathbf{u}\in\mathrm{L}(\mathbf{r})$ the minimum of $f(\mathbf{u},\mathbf{v})$ for $\mathbf{v}\in\mathrm{L}(\mathbf{c})$ is achieved for unique $\mathbf{v}(\mathbf{u})$ which can be obtained as follows. First by use Sinkhorn scaling to $D(\mathbf{u})BD(\tilde{\mathbf{v}}(\mathbf{u}))$ to have the column sum $\mathbf{c}$ . Second let $\mathbf{v}(\mathbf{u})=\tilde{\mathbf{v}}(\mathbf{u})-(\mathbf{c}^{\top}\tilde{\mathbf{v}}(\mathbf{u})/\mathbf{c}^{\top}\mathbf{1}_{m})\mathbf{1}_{m}$ . Since $d=2$ the partial minimization algorithm is completely equivalent to Sinkhorn minimization algorithm. The geometric rate of convergence depends on the estimates of the eigenvalues of Hessian on the sublevel set $f(\mathbf{x})\leq f(\mathbf{x}_{0})$ in $\mathrm{L}(\mathbf{r})\times\mathrm{L}(\mathbf{c})$ .

In the case where $B$ has some zero entires then the scaling problem is solvable if and only if there exist a nonnegative matrix $C=[c_{i,j}]\in\mathbb{R}^{l\times m}$ with the same [math] pattern as $B$ , ( $b_{i,j}=0\iff c_{i,j}=0$ ), and with the row and column sums $\mathbf{r},\mathbf{c}$ [14]. The existence of such $C$ is a linear programming problem that can be solved in polynomial time [12, 13, 8]. If $B$ can be scaled, it is possible to convert the scaling problem to partial minimization of $f(\mathbf{x})$ on a corresponding subspace of $L\subset\mathrm{L}(\mathbf{r})\times\mathrm{L}(\mathbf{c})$ .

We now summarize the contents of the paper. In Section 2 we show that our algorithm converges geometrically to $\mathbf{x}^{\star}$ : the unique minimum point of $f$ . Denote by $V(t)=\{\mathbf{x}\in\mathbb{R}^{n},f(\mathbf{x})\leq t\}$ the compact convex sublevel set corresponding to $t\geq t^{\star}=f(\mathbf{x}^{\star})$ . Let $0<\alpha(t)\leq\beta(t)$ be the minimum and the maximum of the smallest and the biggest eigenvalues of the Hessian $H(f)$ in $V(t)$ . Let $\kappa(t)=\frac{\beta(t)}{\alpha(t)}$ . Set $t_{k}=f(\mathbf{x}_{k})$ , where $\mathbf{x}_{k}$ are given by our algorithm. Then $t_{k}$ is a strictly decreasing sequence which converges to $t^{\star}$ , unless the algorithm reaches $\mathbf{x}^{\star}$ in a finite number of steps . Theorem 2.4 shows that the rate of convergence of $\mathbf{x}_{k}$ to $\mathbf{x}^{\star}$ and $t_{k}$ to $t^{\star}$ is at least of order $(1-\frac{1}{(d-1)\kappa(t_{0})})^{k-1}$ . More precisely,

[TABLE]

In Section 3 we recall our results on tensor scaling [8]. Assume that $\mathcal{B}=[b_{i_{1},\ldots,i_{d}}]\in\mathbb{R}^{m_{1}}\times\ldots\times\mathbb{R}^{m_{d}}$ is a given nonnegative $d$ -mode tensor. Let $\mathbf{x}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{d})$ , where $\mathbf{x}_{j}=(x_{j,1},\ldots,x_{d,m_{j}})^{\top}\in\mathbb{R}^{m_{j}}$ . A scaling of $\mathcal{B}$ is the tensor $\mathcal{B}(\mathbf{x})=[e^{x_{1,i_{1}}+\ldots+x_{d,i_{d}}}b_{i_{1},\ldots,i_{d}}]$ . Let $\mathbf{s}_{j}$ be positive probability vectors in $\mathbb{R}^{m_{j}}$ for $j\in[d]$ . Then the scaling problem is to find $\mathbf{x}$ such that the $j$ -th slice sum of $\mathcal{B}(\mathbf{x})$ , obtained by summing on the indices $i_{1},\ldots,i_{j-1},i_{j+1},\ldots,i_{d}$ , is $\mathbf{s}_{j}$ for each $j\in[d]$ . If $\mathcal{B}$ is positive then such scaling exists. If $\mathcal{B}$ has zero entries then such scaling exists if and only if there exists a nonnegative tensor $\mathcal{C}$ with the same [math] pattern as $\mathcal{B}$ and with the sum slices $\mathbf{s}_{1},\ldots,\mathbf{s}_{d}$ [3, 7, 8]. We show that if scaling of $\mathcal{B}$ exists then it can be achieved by finding the minimum of the strict convex function $f$ on a subspace $\mathrm{L}\subset\mathrm{L}(\mathbf{s}_{1})\times\ldots\times\mathrm{L}(\mathbf{s}_{d})$ .

In Section 4 we discuss the application of our algorithm to tensor scaling. In the case where $\mathcal{B}$ positive, or more general, where the strict convex function $f$ is defined on the whole $\mathrm{L}(\mathbf{s}_{1})\times\ldots\times\mathrm{L}(\mathbf{s}_{d})$ , our algorithm applies straightforward. For matrices, $d=2$ it is exactly the Sinkhorn scaling algorithm, which was explained above. In the case of tensors, $d\geq 3$ , the algorithm chooses each time the scaling slice. In the case where $f$ is strictly convex on a subspace $\mathrm{L}\subset\mathrm{L}(\mathbf{s}_{1})\times\ldots\times\mathrm{L}(\mathbf{s}_{d})$ , we describe a simple modification of our algorithm and justify its geometric convergence.

In Section 5 we show that our algorithm applies also to a generalized discrete Schrödinger’s bridge problem. (The discrete Schrödinger’s bridge problem is a scaling of a given column stochastic matrix to another column stochastic matrix $B$ so that $B\mathbf{a}=\mathbf{b}$ , where $\mathbf{a},\mathbf{b}$ are two given positive probabiitiy vectors [10, 9].)

2. The convergence of the algorithm

Lemma 2.1.

Let $f\in\mathrm{C}^{2}(\mathbb{R}^{n})$ be strictly convex. Then the following conditions are equivalent:

(1)

The function $f$ has a unique minimum $\mathbf{x}^{\star}\in\mathbb{R}^{n}$ . 2. (2)

The condition (1.1) holds.

Proof.

(1) $\Rightarrow$ (2). Let $\mathrm{S}^{n-1}$ be the $n-1$ dimensional sphere $\|\mathbf{y}-\mathbf{x}^{\star}\|=1$ . Fix $\mathbf{y}\in\mathrm{S}^{n-1}$ . Consider the strict convex function in one variable: $g_{\mathbf{y}}(t)=f(\mathbf{x}^{\star}+t(\mathbf{y}-\mathbf{x}^{\star}))$ . Then $g_{\mathbf{y}}^{\prime}(0)=0$ and $g_{\mathbf{y}}^{\prime}(1)=\nabla f(\mathbf{y})^{\top}(\mathbf{y}-\mathbf{x}^{\star})>0$ . Let $\nu=\min\{g_{\mathbf{y}}^{\prime}(1),\mathbf{y}\in\mathrm{S}^{n-1}\}$ . Clearly, $\nu>0$ . As $g_{\mathbf{y}}^{\prime}(t)$ increases for $t>0$ it follows that $g_{\mathbf{y}}^{\prime}(t)\geq g^{\prime}_{\mathbf{y}}(1)$ for $t\geq 1$ . In particular,

[TABLE]

Hence $f(\mathbf{x})\geq f(\mathbf{x}^{\star})+\nu(\|\mathbf{x}\|-1)$ if $\|\mathbf{x}-\mathbf{x}^{\star}\|\geq 1$ . This inequality yields (1.1).

(2) $\Rightarrow$ (1) Fix $\mathbf{x}_{0}\in\mathbb{R}^{n}$ . Then there exists $r>0$ such that $\min\{f(\mathbf{x}),\|\mathbf{x}-\mathbf{x}_{0}\|=r\}>f(\mathbf{x}_{0})$ . Let $\min\{f(\mathbf{x}),\|\mathbf{x}-\mathbf{x}_{0}\|\leq r\}=f(\mathbf{x}^{\star})$ . Clearly, $\|\mathbf{x}^{\star}-\mathbf{x}_{0}\|<r$ . Therefore $\nabla f(\mathbf{x}^{\star})=\mathbf{0}$ . As $f(\mathbf{x})$ is convex we deduce that $f(\mathbf{x})\geq f(\mathbf{x}_{0})$ for each $\mathbf{x}\in\mathbb{R}^{n}$ . As $f(\mathbf{x})$ is strictly convex $\mathbf{x}^{\star}$ is the unique point of minimum of $f$ . ∎

Note that the function $f(x)=e^{x},x\in\mathbb{R}$ is strictly convex on $\mathbb{R}$ but $f(x)$ does not have a minimum on $\mathbb{R}$ .

In what follows we assume that $f\in\mathrm{C}^{2}(\mathbb{R}^{n})$ is strictly convex and $\mathbf{x}^{\star}$ is the unique minimum point of $f$ . Then for each $\mathbf{x}\in\mathbb{R}^{n}\setminus{\mathbf{x}^{\star}}$ the sublevel set

[TABLE]

is a compact strictly convex set, with a $\mathrm{C}^{2}$ boundary $\partial V(t)$ , with an interior containing $\mathbf{x}^{\star}$ . Let $t^{\star}=f(\mathbf{x}^{\star})$ . Then $V(t^{\star})=\{\mathbf{x}^{\star}\}$ . Thus $\mathbb{R}^{n}\setminus\{\mathbf{x}^{\star}\}$ is parametrized by $\partial V(t),t>t^{\star}$ .

Fix $t_{0}=f(\mathbf{x}_{0})>t^{\star}$ . Then $f$ is uniformly strictly convex in $V(t_{0})$ : The eigenvalues of $H(f)(\mathbf{x}),\mathbf{x}\in V(t_{0})$ are in a fixed interval $[\alpha(t_{0}),\beta(t_{0})]$ for some $0<\alpha(t_{0})\leq\beta(t_{0})$ . Thus for each $\mathbf{x},\mathbf{y}\in V(t_{0})$ we have the inequalities:

[TABLE]

In particular, for $\mathbf{x}\in V(t_{0})$ we have

[TABLE]

Denote by $\mathrm{B}(\mathbf{x},R^{2})$ the closed ball $\{\mathbf{y}\in\mathbb{R}^{n},\|\mathbf{x}-\mathbf{y}\|^{2}\leq R^{2}\}$ . Let $\kappa(t_{0})=\frac{\beta(t_{0})}{\alpha(t_{0})}$ and define

[TABLE]

In what follows we need the following lemma:

Lemma 2.2.

Assume that $\mathbf{x}\in V(t_{0})$ . Let

[TABLE]

Then

(1)

$f(\mathbf{x}^{a})\leq f(\mathbf{x})$ . 2. (2)

$[\mathbf{x},\mathbf{x}^{a}]\subset V(t_{0})$ . 3. (3)

$f(\mathbf{x})-f(\mathbf{x}^{\star})\geq f(\mathbf{x})-f(\mathbf{x}^{+})\geq\frac{\|\nabla f(\mathbf{x})\|^{2}}{2\beta(t_{0})}$ .

Proof.

(1) If $\nabla f(\mathbf{x})=\mathbf{0}$ , i.e., $\mathbf{x}=\mathbf{x}^{\star}$ the (1) trivially holds. Suppose that $\nabla f(\mathbf{x})\neq\mathbf{0}$ and assume to the contrary that $f(\mathbf{x}^{a})>f(\mathbf{x})$ . Let $h(t)=f(\mathbf{x}-t\nabla f(\mathbf{x}))$ . Then $h^{\prime}(0)=-\|\nabla f(\mathbf{x})\|^{2}$ . Recall that $h(t)$ is a strict convex function. Hence there exists $t_{1}\in(0,\frac{2}{\beta(t_{0})})$ such that $h^{\prime}(t_{1})=0$ and $h^{\prime}(t)>0$ for $t>t_{1}$ . Thus there exists $t_{2}\in(t_{1},\frac{2}{\beta(t_{0})})$ such that $f(\mathbf{y})=f(\mathbf{x})$ for $\mathbf{y}=\mathbf{x}-t_{2}\nabla f(\mathbf{x}))$ . Note that $\mathbf{y}\in V(t_{0})$ . This contradicts the inequality (2.2).

(2) As $f(\mathbf{x}^{a})\leq f(\mathbf{x})\leq t_{0}$ the convexity of $f$ yields that the interval $[\mathbf{x},\mathbf{x}^{a}]$ is in $V(t_{0})$ .

(3) Clearly $\mathbf{x}^{+}=\frac{1}{2}(\mathbf{x}+\mathbf{x}^{a})\in[\mathbf{x},\mathbf{x}^{a}]$ . Hence

[TABLE]

Therefore (3) holds. ∎

We now bring the following simple lemma which is basically in [6]:

Lemma 2.3.

Assume that $f\in\mathrm{C}^{2}(\mathbb{R}^{n})$ is strictly convex and $\mathbf{x}^{\star}$ is the unique minimum point of $f$ . Fix $\mathbf{x}\in V(t_{0})$ and assume that $\mathbf{x}^{\star}\in\mathrm{B}(\mathbf{x},R_{0}^{2})$ . Then we can choose $R_{0}=R(\mathbf{x})$ and the following conditions hold:

[TABLE]

Proof.

As $\mathbf{x}\in V(t_{0})$ the left hand side of (2.3) yields that $\mathbf{x}^{\star}\in\mathrm{B}(\mathbf{x},R(\mathbf{x})^{2})$ , where $\mathbb{R}(\mathbf{x})^{2}$ is given by (2.7). Clearly

[TABLE]

As $\mathbf{x}^{\star},\mathbf{x}\in V(t_{0})$ (2.1) yields:

[TABLE]

Thus

[TABLE]

This proves the first part of (2.8). Hence the inequality in (2.7) holds. Use part (3) of Lemma 2.2 to replace $f(\mathbf{x})$ in the first part of (2.8) by a smaller quantity $f(\mathbf{x}^{+})+\frac{\|\nabla f(\mathbf{x})\|^{2}}{2\beta(t_{0})}$ to obtain the second part of (2.8).

Combine (2.6) with (2.3) to deduce

[TABLE]

This show the inequality (2.9). ∎

We now show that in our algorithm the sequences $\mathbf{x}_{k},f(\mathbf{x}_{k}),k\in\mathbb{N}$ converge geometrically to $\mathbf{x}^{\star},f(\mathbf{x}^{\star})$ respectively:

Theorem 2.4.

Assume that $f\in\mathrm{C}^{2}(\mathbb{R}^{n})$ is a strict convex function which has a unique minimum point $\mathbf{x}^{\star}$ . Let $\mathbf{x}_{0}\in\mathbb{R}^{n}$ and $\mathbf{x}_{k},k\in\mathbb{N}$ be given by our algorithm. Set $t_{k}=f(\mathbf{x}_{k})$ for $k\in\mathbb{Z}_{+}$ . Assume that the eigenvalues of $H(f)(\mathbf{x}),\mathbf{x}\in V(t_{k})$ are in the minimal interval $[\alpha(t_{k}),\beta(t_{k})]$ , where $0<\alpha(t_{k})\leq\beta(t_{k})$ . Denote $\kappa(t_{k})=\frac{\beta(t_{k})}{\alpha(t_{k})}$ .

(1)

If $\mathbf{x}_{k-1}\neq\mathbf{x}^{\star}$ for some $k\in\mathbb{N}$ then $t_{k-1}>t_{k}$ . 2. (2)

The sequences $\{t_{k}\},\{\beta(t_{k})\},\{-\alpha(t_{k})\},\{\kappa(t_{k})\},k\in\mathbb{Z}_{+}$ are nonincreasing sequences which converge to $t^{\star},\beta(t^{\star}),-\alpha(t^{\star}),\kappa(t^{\star})$ respectively. 3. (3)

For each $k\in\mathbb{N}$ the following inequalities hold:

[TABLE]

Proof.

Note

[TABLE]

(1) Clearly if $\mathbf{x}_{k-1}=\mathbf{x}^{\star}$ then $\mathbf{x}_{p}=\mathbf{x}^{\star}$ for $p\geq k$ . Assume that $\mathbf{x}_{k-1}\neq\mathbf{x}^{\star}$ . Then $\|\nabla f(\mathbf{x}_{k-1})\|>0$ . Let $j_{k-1}\in\operatorname{arg\,max}\{\|\nabla_{l}f(\mathbf{x}_{k-1})\|,l\in[d]\}$ . Then $\|\nabla_{j_{k-1}}f(\mathbf{x}_{k-1})\|>0$ . Hence $t_{k-1}>t_{k}$ .

(2) As $\{t_{k}\},k\in\mathbb{Z}_{+}$ is a nonincreasing sequence we deduce that $V(t_{k})\subseteq V(t_{k-1})$ for $k\in\mathbb{N}$ . Hence the sequence $\{\alpha(t_{k})\},k\in\mathbb{N}$ is a nonincreasing, and the sequences $\{\beta(t_{k})\},k\in\mathbb{N}$ and $\{\kappa(t_{k})\},k\in\mathbb{N}$ are nondecreasing. The equality $\lim_{k\to\infty}t_{k}=t^{\star}$ follows from (2.11). The inequality (2.13) yields

[TABLE]

(3) First we show the inequality (2.11) for $k=1$ . Assume that $j_{0}\in\operatorname{arg\,max}\{\|\nabla_{l}f(\mathbf{x}_{0})\|,l\in[d]\}$ . Hence $\|\nabla_{j_{0}}f(\mathbf{x}_{0})\|\geq\frac{\|\nabla f(\mathbf{x}_{0})\|}{\sqrt{d}}$ . Let $g(\mathbf{x}_{j_{0}})=f(\mathbf{x}_{0}^{j_{0}},\mathbf{x}_{j_{0}})$ , where $\mathbf{x}_{0}=(\mathbf{x}_{0}^{j_{0}},\mathbf{x}_{j_{0},0})$ . Thus $g$ is a strictly convex function, whose Hessian is a submatrix of the Hessian of $f$ . Hence the eigenvalues of the Hessian of $g$ are also in the interval $[\alpha(t_{0}),\beta(t_{0})]$ . Recall that $\operatorname{arg\,min}g=\mathbf{x}_{j_{0}}^{\star}=\mathbf{x}_{j}(\mathbf{x}_{0}^{j_{0}})$ . Then $\mathbf{x}_{1}=(\mathbf{x}_{0}^{j_{0}},\mathbf{x}_{j_{0}}^{\star})$ . We now estimate from below $g(\mathbf{x}_{j_{0},0})-g(\mathbf{x}_{j_{0}}^{\star})$ . The lower bound (3) of Lemma (2.2) yields:

[TABLE]

The inequality (2.7) yields $f(\mathbf{x}_{0})-f(\mathbf{x}^{\star})\leq\frac{\|\nabla f(x_{0})\|^{2}}{2\alpha(t_{0})}$ . Assuming that $f(\mathbf{x}_{0})>f(\mathbf{x}^{\star})$ we obtain

[TABLE]

This proves the first inequality in (2.11) for $k=1$ .

Assume now that $k=2$ . The definition of $\mathbf{x}_{1}$ yields that $\nabla_{j_{0}}f(\mathbf{x}_{1})=0$ . Hence $\max\{\|\nabla_{l}f(\mathbf{x}_{1})\|,l\in[d]\}\geq\frac{\|\nabla f(\mathbf{x}_{1})\|}{\sqrt{d-1}}$ . Use the same arguments as above to show that $f(\mathbf{x}_{2})-f(\mathbf{x}^{\star})\leq(f(\mathbf{x}_{1})-f(\mathbf{x}^{\star}))(1-\frac{1}{(d-1)\kappa(t_{1})})$ . Hence (2.11) holds for $k=2$ . Similarly, the inequality (2.11) holds for each $k\geq 2$ .

Use the inequality (2.7) to deduce the inequality in (2.12). As $\kappa(t_{k})\leq\kappa(t_{0})$ for each $k\in\mathbb{N}$ we deduce the inequality below (2.12). According to Lemma 2.3 $\mathbf{x}^{\star}\in\mathrm{B}(\mathbf{x}_{k},R^{2}(\mathbf{x}_{k}))$ . Use (2.12) to deduce (2.13). ∎

Observe that our algorithm is an alternating algorithm for $d=2$ after the first step.

3. The tensor scaling problem

In this section we first recall briefly the results in [8] that we need. For positive integers $d,m_{1},\ldots,m_{d}$ denote by $\mathbb{R}^{m_{1}\times\ldots\times m_{d}}$ the linear space $d$ -mode tensors $\mathcal{A}=[a_{i_{1},i_{2},\ldots,i_{d}}],i_{j}\in[m_{j}],j\in[d]$ . Note that a $1$ -mode tensor is a vector, and a $2$ -mode tensor is a matrix. Assume that $d\geq 2$ . For a fixed $i_{k}\in[m_{k}]$ the $(d-1)$ -mode tensor $[a_{i_{1},\ldots,i_{d}}],i_{j}\in[m_{j},j\in[d]\backslash\{k\}$ is called the $(k,i_{k})$ slice of $\mathcal{A}$ . For $d=2$ the $(1,i)$ slice and the $(2,j)$ slice are the $i-th$ row and the $j-th$ column of a given matrix. In the rest of the paper we assume:

[TABLE]

Let

[TABLE]

be the $(k,i_{k})$ -slice sum. Denote

[TABLE]

the $k$ -slice sum. Note that $k$ -slice sums satisfy the compatibility conditions

[TABLE]

Two $d$ -mode tensors $\mathcal{A}=[a_{i_{1},i_{2},\ldots,i_{d}}],\mathcal{B}=[b_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}^{m_{1}\times\ldots\times m_{d}}$ are called positive diagonally equivalent if there exist $\mathbf{x}_{k}=(x_{k,1},\ldots,x_{k,m_{k}})^{\top}\in\mathbb{R}^{m_{k}},k\in[d]$ such that $a_{i_{1},\ldots,i_{d}}=e^{x_{1,i_{1}}+\ldots+x_{d,i_{d}}}b_{i_{1},\ldots,i_{d}}$ for all $i_{j}\in[m_{j}]$ and $j\in[d]$ . Denote by $\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ the cone of nonnegative,(entrywise), $d$ -mode tensors.

We assume that $\mathcal{B}=[b_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ is a given nonnegative tensor with no zero slice $(k,i_{k})$ . Let $\mathbf{s}_{k}\in\mathbb{R}_{+}^{m_{k}},k\in[d]$ are given $k$ positive vectors satisfying the conditions (3.4). Denote by $\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}(\mathcal{B},\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ the set of all nonnegative $\mathcal{A}=[a_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots m_{d}}$ having the same zero pattern as $\mathcal{B}$ , i.e. $a_{i_{1},\ldots,i_{d}}=0\iff b_{i_{1},\ldots,i_{d}}=0$ for all indices $i_{1},\ldots,i_{d}$ , and satisfying the condition (3.2). We now recall the necessary and sufficient conditions on $\mathcal{B}$ so that $\mathbb{R}_{+}^{m_{1}\times\ldots m_{d}}(\mathcal{B},\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ contains a tensor $\mathcal{A}$ , which is positively diagonally equivalent to $\mathcal{B}$ . For matrices, i.e. $d=2$ , this problem was solved by Menon [14] and Brualdi [4]. See also [15]. For the special case of positive diagonal equivalence to doubly stochastic matrices see [5] and [20]. The result of Menon was extended for tensors independently by Bapat-Raghavan [3] and Franklin-Lorenz [7]. (See [2] and [16] for the special case where all the entries of $\mathcal{B}$ are positive.) In [8] we gave necessary and sufficient conditions for the solution of this problem:

Theorem 3.1.

Let $\mathcal{B}=[b_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , $(d\geq 2)$ , be a given nonnegative tensor with no $(k,i_{k})$ -zero slice. Let $\mathbf{s}_{k}\in\mathbb{R}_{+}^{m_{k}},k=1,\ldots,d$ be given positive vectors satisfying (3.4). Then there exists a nonnegative tensor $\mathcal{A}\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , which is positive diagonally equivalent to $\mathcal{B}$ and having each $(k,i_{k})$ -slice sum equal to $s_{k,i_{k}}$ , if and only the following conditions hold: The system of the inequalities and equalities for $\mathbf{x}_{k}=(x_{k,1},\ldots,\mathbf{x}_{k,m_{k}})^{\top}\in\mathbb{R}^{m_{k}},k=1,\ldots,d$ ,

[TABLE]

imply one of the following equivalent conditions

(1)

$x_{1,i_{1}}+x_{2,i_{2}}+\ldots+x_{d,i_{d}}=0$ * if $b_{i_{1},i_{2},\ldots,i_{d}}>0$ .* 2. (2)

$\sum_{b_{i_{1},i_{2},\ldots,i_{d}}>0}x_{1,i_{1}}+x_{2,i_{2}}+\ldots+x_{d,i_{d}}=0$ .

In particular, there exists at most one tensor $\mathcal{A}\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ with $(k,i_{k})$ -slice sum $s_{k,i_{k}}$ for all $k,i_{k}$ , which is positive diagonally equivalent to $\mathcal{B}$ .

The above yields the following corollary.

Corollary 3.2.

Let $\mathcal{B}\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , $(d\geq 2)$ , be a given nonnegative tensor with no $(k,i_{k})$ -zero slice. Let $\mathbf{s}_{k}\in\mathbb{R}_{+}^{m_{k}},k=1,\ldots,d$ be given positive vectors. Then there exists a nonnegative tensor $\mathcal{C}\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , which is positive diagonally equivalent to $\mathcal{B}$ and each $(k,i_{k})$ -sum slice equal to $s_{k,i_{k}}$ , if and only if there exists a nonnegative tensor $\mathcal{A}=[a_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , having the same zero pattern as $\mathcal{B}$ , which satisfies (3.2).

For matrices, i.e. $d=2$ , the above corollary is due Menon [14]. For $d=3$ this result is due to [3, Thm 3] and for $d\geq 3$ [7]. Brualdi in [4] gave a nice and simple characterization for the set of nonnegative matrices, with prescribed zero pattern and with given positive row and column sums, to be not empty. It is an open problem to find an analog of Brualdi’s results for $d$ -mode tensors, where $d\geq 3$ .

Note that the conditions of Theorem 3.1 are stated as a linear programming problem. Hence the existence of a positive diagonally equivalent tensor $\mathcal{A}$ can be determined in polynomial time [12, 13]. If such $\mathcal{A}$ exists, it is shown in [8] that $\mathcal{A}$ can be found by computing the unique minimal point of certain strictly convex functions $f$ . Note that $\mathcal{B}>0$ is always scalable as the tensor $\mathcal{A}=b\mathbf{s}_{1}\otimes\cdots\otimes\mathbf{s}_{d}$ satisfies (3.2) for $b=(\mathbf{1}_{m_{1}}^{\top}\mathbf{s}_{1})^{d-1}$ .

Identify $\mathbb{R}^{m_{1}}\times\mathbb{R}^{m_{2}}\times\ldots\times\mathbb{R}^{m_{d}}$ with $\mathbb{R}^{n+d}$ , where $n+d=\sum_{k=1}^{d}m_{k}$ . We view $\mathbf{x}\in\mathbb{R}^{n+d}$ as a vector $(\mathbf{x}_{1}^{\top},\ldots,\mathbf{x}_{d}^{\top})^{\top}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{d})$ , where $\mathbf{x}_{k}\in\mathbb{R}^{m_{k}},k\in[d]$ . Let $\|\mathbf{x}\|:=\sqrt{\mathbf{x}^{\top}\mathbf{x}}$ . Define

[TABLE]

Clearly, $\hat{f}$ is a convex function on $\mathbb{R}^{n+d}$ . Denote by $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})\subset\mathbb{R}^{n+d}$ the subspace of vectors $(\mathbf{x}_{1},\ldots,\mathbf{x}_{d})$ satisfying the equalities (3.6). Thus $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})=\mathrm{L}(\mathbf{s}_{1})\times\cdots\times\mathrm{L}(\mathbf{s}_{d})\equiv\mathbb{R}^{n}$ . In [8] we showed the following lemma:

Lemma 3.3.

Let $\mathcal{B}=[b_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , $(d\geq 2)$ , be a given nonnegative tensor with no $(k,i_{k})$ -zero slice. Let $\mathbf{s}_{k}\in\mathbb{R}_{+}^{m_{k}},k=1,\ldots,d$ be given positive vectors satisfying (3.4). Then there exists a nonnegative tensor $\mathcal{A}\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , which is positive diagonally equivalent to $\mathcal{B}$ and having each $(k,i_{k})$ -slice sum equal to $s_{k,i_{k}}$ , if and only the restriction of $\hat{f}$ to the subspace $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ , denoted as $\tilde{f}$ , has a critical point.

Denote by $\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ the subspace of all vectors $(\mathbf{x}_{1},\ldots,\mathbf{x}_{d})$ satisfying the condition 1 of Theorem 3.1. Clearly, for each $\mathbf{x}\in\mathbb{R}^{n+d}$ the function $\hat{f}$ has a constant value $\hat{f}(\mathbf{x})$ on the affine set $\mathbf{x}+\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . Let $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})=\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})\cap\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ Hence, if $\mbox{\boldmath{$ \eta $}}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ is a critical point of $\tilde{f}$ then any point in $\mbox{\boldmath{$ \eta $}}+\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ is also a critical of $\tilde{f}$ . Denote by $\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}\subset\mathbb{R}^{n+d}$ the orthogonal complement of $\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ in $\mathbb{R}^{n+d}$ , and by $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ , the orthogonal complement of $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ in $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . In [8] we showed:

Lemma 3.4.

Let $\mathcal{B}=[b_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , $(d\geq 2)$ , be a given nonnegative tensor with no $(k,i_{k})$ -zero slice. Let $\mathbf{s}_{k}\in\mathbb{R}_{+}^{m_{k}},k=1,\ldots,d$ be given positive vectors satisfying (3.4). Let $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d}),\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d}),\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ be defined as above. Then the restriction of $\tilde{f}$ to $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ , denoted as $f$ , is strictly convex. That is, $H(f)$ has positive eigenvalues at each point of $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ .

Theorem 3.5.

Let $\mathcal{B}=[b_{i_{1},i_{2},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}\times\ldots\times m_{d}}$ , $(d\geq 2)$ , be a given nonnegative tensor with no $(k,i_{k})$ -zero slice. Let $\mathbf{s}_{k}\in\mathbb{R}_{+}^{m_{k}},k=1,\ldots,d$ be given positive vectors satisfying (3.4). Then the following conditions are equivalent.

(1)

$\tilde{f}$ * has a global minimum.* 2. (2)

$\tilde{f}$ * has a critical point.* 3. (3)

$\lim f(\mathbf{x}_{l})=\infty$ * for any sequence $\mathbf{x}_{l}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{k})^{\perp}$ such that $\lim\|\mathbf{x}_{l}\|=\infty$ .* 4. (4)

The only $\mathbf{x}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{d})\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{k})^{\perp}$ that satisfies (3.5) is $\mathbf{x}=\mathbf{0}_{n}$ .

4. The scaling algorithm for tensors

In this section we assume that a given $\mathcal{B}=[b_{i_{1},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}}\times\cdots\times\mathbb{R}_{+}^{m_{d}}$ satisfies one of the equivalent conditions of Theorem 3.5. Let $\mathcal{B}(\mathbf{x})=[b_{i_{1},\ldots,i_{d}}e^{x_{1,i_{1}}+\cdots+x_{d,i_{d}}}]$ . Hence $f$ has a unique minimum point $\mathbf{x}^{\star}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ . We now describe our algorithm for finding $\mathbf{x}^{\star}$ .

We first consider the case where $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})=\{\mathbf{0}\}$ . That is, the system of linear equations given by (3.6) and by the conditions (1) of Theorem 3.5 has only the trivial solution $\mathbf{x}_{1}=\cdots=\mathbf{x}_{d}=\mathbf{0}$ .

This condition is satisfied if all the entries of $\mathcal{B}$ are positive. Indeed, assume that $\mathcal{B}>0$ . Sum up the equations in condition (1) on $i_{2},\ldots,i_{d}$ to deduce that $\mathbf{x}_{1}=t_{1}\mathbf{1}_{m_{1}}$ . Similarly, we deduce that $\mathbf{x}_{j}=t_{j}\mathbf{1}_{m_{j}}$ for all $j\in[d]$ . Furthermore the $M=\prod_{j=1}^{d}m_{j}$ equations of (1) are equivalent to one equaiton: $t_{1}+\cdots+t_{d}=0$ . The conditions (3.6) yield that $t_{1}=\cdots=t_{d}=0$ .

In this case $\tilde{f}=f$ is a function defined on $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . We identify $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ with $\mathbb{R}^{n}=\mathbb{R}^{m_{1}-1}\times\cdots\mathbb{R}^{m_{d}-1}$ . Then our algorithm is applied straightforward as in the case $d=2$ , which is described in Section 1: Fix $\mathbf{x}^{j}$ and find the unique $\tilde{\mathbf{x}}_{j}=\tilde{\mathbf{x}}_{j}(\mathbf{x}^{j})$ which satsfies the condition

[TABLE]

Let $\mathbf{x}_{j}(\mathbf{x}^{j})=\tilde{\mathbf{x}}_{j}-\frac{\mathbf{s}_{j}^{\top}\tilde{\mathbf{x}}_{j}}{\mathbf{s}_{j}^{\top}\mathbf{1}_{m_{j}}}\mathbf{1}_{m_{j}}$ . Clearly, $\mathbf{x}_{j}(\mathbf{x}^{j})\in\mathrm{L}(\mathbf{s}_{j})$ . Hence $\mathbf{x}_{j}(\mathbf{x}^{j})$ is the critical point of the strict convex function $g_{\mathbf{x}^{j}}(\mathbf{x}_{j})=f(\mathbf{x}^{j},\mathbf{x}_{j})$ on $\mathrm{L}(\mathbf{s}_{j})\equiv\mathbb{R}^{m_{j}-1}$ .

We now can apply Theorem 2.4. Our algorithm will converge to a unique minimal point $\mathbf{x}^{\star}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . The tensor $\mathcal{B}(\mathbf{x}^{\star})$ will have its $d$ sum slices of the form $b\mathbf{s}_{1},\ldots,b\mathbf{s}_{d}$ for some $b>0$ .

We now discuss the case where $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ is a nontrivial subspace of $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . In that case we claim that our algorithm applies with a suitable modification. First observe that

[TABLE]

Let

[TABLE]

be the orthogonal projection on $\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ and $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ respectively. Then

[TABLE]

If $\mathbf{x}\in\mathbb{R}^{n}$ then $\mathbf{y}=P\mathbf{x}\in\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp},\mathbf{z}=(I-P)\mathbf{x}\in\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . If $\mathbf{x}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ then $\mathbf{y}=P_{0}\mathbf{x}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ and $\mathbf{z}=(I-P_{0})\mathbf{x}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ .

Observe that $\hat{f}(\mathbf{x}+\mathbf{z})=\hat{f}(\mathbf{x})$ for $\mathbf{x}\in\mathbb{R}^{n}$ and $\mathbf{z}\in\mathbf{V}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . Hence

[TABLE]

Similarly $\tilde{f}(\mathbf{x}+\mathbf{z})=\tilde{f}(\mathbf{x})$ for $\mathbf{x}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ and $\mathbf{z}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . Furthermore

[TABLE]

(The simplest way to show these identities is by considering an orthonormal basis in in $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ consisting of vectors in orthonormal bases of $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\top}$ and $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . Then change to a basis of $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})=\mathrm{L}(\mathbf{s}_{1})\times\cdots\times\mathrm{L}(\mathbf{s}_{d})$ which is a union of orthonormal bases of $\mathrm{L}(\mathbf{s}_{j})$ for $j\in[d]$ .)

Observe that for $\mathbf{x}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ the gradient $\nabla\tilde{f}(\mathbf{x})$ is a subvector of $\nabla\hat{f}(\mathbf{x})$ , when we choose the corresponding coordinates in $\mathbb{R}^{m_{1}}\times\cdots\mathbb{R}^{m_{d}}$ . Similarly, for $\mathbf{x}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ the gradient $\nabla f(\mathbf{x})$ is a subvector of $\nabla\tilde{f}(\mathbf{x})$ , if we choose the coordinates using the orthonormal bases in $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ and $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ respectively. Moreover, the coordinates of $\nabla\tilde{f}(\mathbf{x})$ corresponding to the chosen orthonormal basis in $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ are zero. Hence for $\mathbf{x}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ the gradient $\nabla f(\mathbf{x})$ is obtained by deleting the zero coordinates of $\nabla\tilde{f}(\mathbf{x})$ corresponding to the chosen orthonormal basis of $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . In particular we have the equality

[TABLE]

Lemma 4.1.

Assume that a given $\mathcal{B}=[b_{i_{1},\ldots,i_{d}}]\in\mathbb{R}_{+}^{m_{1}}\times\cdots\times\mathbb{R}_{+}^{m_{d}}$ satisfies the assumptions of Theorem 3.5 and one of its equivalent conditions . Suppose furthermore that $\dim\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})>0$ . For $j\in[d]$ let $\mathbf{W}_{j}$ be the following subspace of $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ : $\{\mathbf{w}=P_{0}(\mathbf{0},\mathbf{x}_{j}),\mathbf{x}_{j}\in\mathrm{L}(\mathbf{s}_{j})\}$ . Then

(1)

The dimension of $\mathbf{W}_{j}$ is $m_{j}-1$ for $j\in[d]$ . 2. (2)

$\mathbf{W}_{1}+\cdots+\mathbf{W}_{d}=\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ . Furthermore, $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ is not a direct sum of $\mathbf{W}_{1},\ldots,\mathbf{W}_{d}$ . 3. (3)

Let $\mathbf{x}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ and $j\in[d]$ . Choose an orthonormal basis in $\mathbf{W}_{j}$ and denote by $\nabla f_{\mathbf{W}_{j}}(\mathbf{x})$ the gradient of $f$ with respect to the chosen orthonormal basis of the subspace $\mathbf{W}_{j}$ . Then

[TABLE]

Proof.

(1) In view of the assumption (3.1) it follows that $\dim\mathrm{L}(\mathbf{s}_{j})=m_{j}-1\geq 1$ . Assume to the contrary that $\dim\mathbf{W}_{j}<m_{j}-1$ , Then there exists $\mathbf{x}_{j}\in\mathrm{L}(\mathbf{s}_{j})\setminus\{\mathbf{0}\}$ such that $P_{0}(\mathbf{0},\mathbf{x}_{j})=\mathbf{0}$ . Use the first equality of(4.1) to deduce that $\tilde{f}((\mathbf{0},t\mathbf{x}_{j}))=f(P_{0}(\mathbf{0},t\mathbf{x}_{j}))=f(\mathbf{0})$ for each $t\in\mathbb{R}$ . As $\mathcal{B}$ is a nonnegative tensor with no $(k,i_{k})$ -zero slice it follows that

[TABLE]

As $\mathbf{x}_{j}\neq\mathbf{0}$ the above function of $t$ can’t be a constant function. Hence $\dim\mathbf{W}_{j}=m_{j}-1$ .

(2) Let $\mathbf{x}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{d})\in\mathrm{L}(\mathbf{s}_{1})\times\cdots\times\mathrm{L}(\mathbf{s}_{d})$ . Then $\mathbf{x}=\sum_{j=1}^{d}(\mathbf{0},\mathbf{x}_{j})$ . Hence $P_{0}\mathbf{x}=\sum_{j=1}^{d}P_{0}(\mathbf{0},\mathbf{x}_{j})$ . Therefore $\mathbf{W}_{1}+\cdots+\mathbf{W}_{d}=\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ . Clearly

[TABLE]

Hence $\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ is not a direct sum of $\mathbf{W}_{1},\ldots,\mathbf{W}_{d}$ .

(3) If $\nabla_{j}\tilde{f}(\mathbf{x})=\mathbf{0}$ the inequality (4.3) trivially holds. Assume that $\nabla_{j}\tilde{f}(\mathbf{x})\neq\mathbf{0}$ . Let $\mathbf{w}_{j}=\frac{1}{\|\nabla_{j}\tilde{f}(\mathbf{x})\|}\nabla_{j}\tilde{f}(\mathbf{x})$ . Then $\|\nabla_{j}\tilde{f}(\mathbf{x})\|=\nabla\tilde{f}(\mathbf{x})^{\top}(\mathbf{0},\mathbf{w}_{j})$ . Let

[TABLE]

As $\nabla\tilde{f}(\mathbf{x})^{\top}\mathbf{v}_{j}=0$ we deduce that

[TABLE]

Use (4.2) and the above inequalities to deduce (4.3). ∎

We now give the modified algorithm:

Modified algorithm

Choose $\mathbf{x}_{0}\in\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})^{\perp}$ .
for $k:=0,1,2,\ldots$
$j\in\operatorname{arg\,max}\{\|\nabla_{\mathbf{W}_{l}}f(\mathbf{x}_{k})\|,l\in[d]\}$
$\mathbf{x}_{k+1}=P_{0}(\mathbf{x}_{k}^{j},\mathbf{x}_{j}(\mathbf{x}_{k}^{j}))$
end

We explain and justify the modified algorithm. View $\mathbf{x}_{0}$ as a point in $\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . Then $\mathbf{x}_{j}(\mathbf{x}^{j}_{0})$ is a critical point of the strict convex function $g_{\mathbf{x}^{j}_{0}}(\mathbf{x}_{j})=\tilde{f}(\mathbf{x}^{j}_{0},\mathbf{x}_{j}),\mathbf{x}_{j}\in\mathrm{L}(\mathbf{s}_{j})$ as in the beginning of this section. Let $\mathbf{x}_{1}^{\prime}=\mathbf{x}_{0}+(\mathbf{0},\mathbf{x}_{j}(\mathbf{x}_{0}^{j})-\mathbf{x}_{j,0})$ . Clearly $\mathbf{x}_{1}^{\prime}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d})$ . Note that $\nabla_{j}\tilde{f}(\mathbf{x}_{1}^{\prime})=0$ . Hence $\tilde{f}(\mathbf{x}_{1}+(\mathbf{0},\mathbf{x}_{j}))\geq f(\mathbf{x}_{1})$ for each $\mathbf{x}_{j}\in\mathrm{L}(\mathbf{s}_{j})$ . Let $\mathbf{x}_{1}=P_{0}\mathbf{x}_{1}^{\prime}=\mathbf{x}_{0}+P_{0}(0,\mathbf{x}_{j}(\mathbf{x}_{0}^{j})-\mathbf{x}_{j,0})$ . The first equality of (4.1) yields:

[TABLE]

Hence $\nabla_{\mathbf{W}_{j}}f(\mathbf{x}_{1})=0$ . Therefore $\mathbf{x}_{1}$ is the minimum of $f$ on the affine space $\mathbf{x}_{0}+\mathbf{W}_{j}$ . Inequality (4.3) yields that $\|\nabla_{\mathbf{W}_{j}}f(\mathbf{x}_{0})\|\geq\frac{\|f(\mathbf{x}_{0})\|}{\sqrt{d}}$ , as in the case of the original algorithm. As $\nabla_{\mathbf{W}_{j}}f(\mathbf{x}_{1})=0$ we deduce from (4.3) that $\|\nabla_{\mathbf{W}_{j}}f(\mathbf{x}_{k})\geq\frac{\|f(\mathbf{x}_{k})\|}{\sqrt{d-1}}$ for $k=1$ . Same inequality holds for all $k\geq 1$ . Hence Theorem 2.4 applies in this case too.

5. A generalization of discrete Schrödinger’s bridge problem

The classical Schrödinger bridge problem, studied by Schrödinger in [17, 18], seeks the most likely probability law for a diffusion process, in path space, that matches marginals at two end points in time. The discrete version of Schrödinger’s bridge problem for Markov chains can be stated as follows [10, 9]:

Problem 5.1.

Let $A\in\mathbb{R}_{+}^{n\times n}$ be a column stochastic matrix. Assume that $\mathbf{a},\mathbf{b}$ are two positive probability vectors. Does there exists a scaling of $A$ , denoted as $B$ , such that $B$ is column stochastic and $B\mathbf{a}=\mathbf{b}$ ?

We give a necessary and sufficient condition for a solution to generalized Schrödinger’s bridge problem:

Theorem 5.2.

Let $A\in\mathbb{R}^{m\times n}_{+}$ be a given matrix. Assume that $\mathbf{b}\in\mathbb{R}^{m},\mathbf{a},\mathbf{c}\in\mathbb{R}^{n}$ be given positive vectors that satisfy $\mathbf{c}^{\top}\mathbf{a}=\mathbf{1}_{m}^{\top}\mathbf{b}$ . Then there exists a scaling of $A$ , denoted as $B$ , such that

[TABLE]

if and only if the following conditions holds: There exists $C\in\mathbb{R}^{m\times n}_{+}$ with the same [math]-pattern as $B$ that satisfies (5.1). If this condition holds then $B$ is unique and can be found by the modified algorithm.

Proof.

Assume that $\mathbf{a}=(a_{1},\ldots,a_{n})^{\top},\mathbf{c}=(c_{1},\ldots,c_{n})^{\top}$ . Denote by $D(\mathbf{a})\in\mathbb{R}^{n\times n}$ the diagonal matrix whose diagonal entries are the coordinates of $\mathbf{a}$ . Let $\tilde{A}=AD(\mathbf{a})$ and consider the scaling of $B=D_{1}\tilde{A}D_{2}$ with the row sum $\mathbf{b}$ and column sum $\mathbf{c}\circ\mathbf{a}=(c_{1}a_{1},\ldots,c_{n}a_{n})^{\top}$ . Note that condition $\mathbf{1}_{n}^{\top}(\mathbf{c}\circ\mathbf{a})=\mathbf{1}_{m}^{\top}\mathbf{b}$ is the condition $\mathbf{c}^{\top}\mathbf{a}=\mathbf{1}_{m}^{\top}\mathbf{b}$ . Next observe that this scaling of $\tilde{A}$ is equivalent to the scaling of $A$ which satisfies (5.1). The result of [14] yields that $B$ exists if and only if there exists $C\in\mathbb{R}^{m\times n}_{+}$ with the same [math]-pattern as $B$ that satisfies (5.1). Use the modifed algorithm to find the scaling of $\tilde{A}$ . ∎

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. Allen-Zhu, Y. Li, R. Oliveira and A. Wigderson, Much faster algorithms for matrix scaling, ar Xiv:1704.02315.
2[2] R.B. Bapat D 1 A D 2 subscript 𝐷 1 𝐴 subscript 𝐷 2 D_{1}AD_{2} theorems for multidimensional matrices, Linear Algebra Appl. 48 (1982), 437–442.
3[3] R.B. Bapat and T.E.S. Raghavan, An extension of a theorem of Darroch and Ratcliff in loglinear models and its application to scaling multidimensional matrices, Linear Algebra Appl. 114/115 (1989), 705-715.
4[4] R.A. Brualdi, Convex sets of nonnegative matrices, Canad. J. Math 20 (1968), 144-157.
5[5] R.A. Brualdi, S.V. Parter and H. Schneider, The diagonal equivalence of a nonnegative matrix to a stochastic matrix, J. Math. Anal. Appl. 16 (1966), 31–50.
6[6] S. Bubeck, Y. T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent, ar Xiv:1506.08187.
7[7] J. Franklin and J. Lorenz, On the scaling of multidimensional matrices, Linear Algebra Appl. 114/115 (1989), 717-735.
8[8] S. Friedland, Positive diagonal scaling of a nonnegative tensor to one with prescribed slice sums, Linear Algebra Appl. , vol. 434 (2011), 1615-1619.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Partial minimization of strict convex functions and tensor scaling

Abstract.

1. Introduction

2. The convergence of the algorithm

Lemma 2.1**.**

Proof.

Lemma 2.2**.**

Proof.

Lemma 2.3**.**

Proof.

Theorem 2.4**.**

Proof.

3. The tensor scaling problem

Theorem 3.1**.**

Corollary 3.2**.**

Lemma 3.3**.**

Lemma 3.4**.**

Theorem 3.5**.**

4. The scaling algorithm for tensors

Lemma 4.1**.**

Proof.

5. A generalization of discrete Schrödinger’s bridge problem

Problem 5.1**.**

Theorem 5.2**.**

Proof.

Lemma 2.1.

Lemma 2.2.

Lemma 2.3.

Theorem 2.4.

Theorem 3.1.

Corollary 3.2.

Lemma 3.3.

Lemma 3.4.

Theorem 3.5.

Lemma 4.1.

Problem 5.1.

Theorem 5.2.