This paper introduces a partial minimization algorithm for strict convex functions, demonstrating geometric convergence and connecting it to classical methods like conjugate gradient and Sinkhorn scaling for matrices and tensors.
Contribution
It presents a novel partial minimization algorithm with proven convergence for convex functions and links tensor and matrix scaling to partial minimization of log-convex functions.
Findings
01
Algorithm converges geometrically to the unique minimum.
02
Connection established between Sinkhorn scaling and partial minimization.
03
Generalization of conjugate gradient method for quadratic polynomials.
Abstract
Assume that f is a strict convex function with a unique minimum in R^n. We divide the vector of n-variables to d groups of vector subvariables with d at least two. We assume that we can find the partial minimum of f with respect to each vector subvariable while other variables are fixed. We then describe an algorithm that partially minimizes each time on a specifically chosen vector subvariable. This algorithm converges geometrically to the unique minimum. The rate of convergence depends on the uniform bounds on the eigenvalues of the Hessian of f in the compact sublevel set f whose values are at most f(x_0), where x_0 is the starting point of the algorithm. In the case where f is a polynomial of degree two, with positive definite quadratic term, and d=n our method can be considered as a generalization of the classical conjugate gradient method. The main result of this paper is the…
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Advanced Optimization Algorithms Research · Tensor decomposition and applications
Full text
Partial minimization of strict convex functions and tensor scaling
Shmuel Friedland
Department of Mathematics and Computer Science, University of Illinois at Chicago, Chicago, Illinois, 60607-7045, USA
Assume that f∈C2(Rn) is a strict convex function with a unique minimum. We divide the vector of n variables to d≥2 groups of vector subvariables. We assume that we can find the partial minimum of f with respect to each vector subvariable while other variables are fixed. We then describe an algorithm that partially minimizes each time on a specifically chosen vector subvariable. This algorithm converges geometrically to the unique minimum. The rate of convergence depends on the uniform bounds on the eigenvalues of the Hessian of f in the compact sublevel set f(x)≤f(x0), where x0 is the starting point of the algorithm. In the case where f(x)=x⊤Ax+b⊤x and d=n our method can be considered as a generalization of the classical conjugate gradient method.
The main result of this paper is the observation that the celebrated Sinkhorn diagonal scaling algorithm for matrices, and the corresponding diagonal scaling of tensors, can be viewed as partial minimization of certain logconvex functions.
Let f∈C2(Rn) is a strict convex function, that is, the Hessian H(f)(x) is positive definite for each x∈Rn. We assume that f has a minimum at x⋆∈Rn, which is necessary unique. It is well known that a necessary and sufficient condition for the existence of x⋆ is:
[TABLE]
See Lemma 2.1. We now recall the notion of partial minimization of f.
For m∈N denote [m]={1,…,m}⊂N.
Divide the vector x=(x1,…,xn)⊤ to d≥2 groups: x⊤=(x1⊤,…,xd⊤), where xi∈Rmi for i∈[d] and ∑i=1pmi=n. (Thus d∈[n]∖{1}.) View x as (xj,xj) where xj∈Rn−mj is obtained from x by deleting the vector coordinate xj. Denote by ∇jf(x)∈Rmj the vector of derivatives of f(x) with respect to the coordinates in xj.
Minimize f(x) with respect to the variable xj while keeping all other variable fixed:
[TABLE]
Our main assumption is that we can find xj(xj) either precisely, or with a prescribed accuracy. This assumption holds if f is a polynomial of degree 2f(x)=x⊤Ax+b⊤x+c, where A is a symmetric positive definite matrix and d=n. This is the classical case of the conjugate gradient [11].
The main point of this paper is to show that this assumption holds if we consider the classical scaling algorithm of Sinkhorn [19], or more general tensor scaling problem [2, 16, 7, 8]. Matrix scaling problems arise in several areas of applied and pure mathematics. There are many available algorithms to achieve the scaling. See [1] for a historical survey and for new suggested algorithms. The main purpose of this paper to show that matrix and tensor scaling could be efficiently implemented using our simple algorithm which ensures geometric convergence. While for matrices our algorithm reduces to alternating scaling, for tensors the algorithm chooses the order of scaling.
We now state briefly our algorithm:
Algorithm
Choose x0∈Rn.
for k:=0,1,2,…
j∈argmax{∥∇lf(xk)∥,l∈[d]}
xk+1=(xkj,xj(xkj))
end
We show that this algorithm converges geometrically to x⋆ with at least a factor (1−d−1βα), where α and β are the minimum and the maximum of the lowest and highest eigenvalues of H(f) respectively in the compact convex sublevel region {x,f(x)≤f(x0)}.
Note that if d=2, i.e., x=(x1,x2), then after one iteration the above minimization algorithm is an alternating minimization, as in the Sinkhorn algorithm. Instead of using the standard coordinates x=(x1,…,xn)⊤ we can use the coordinates x^=Px, where the n rows of P: p1⊤,…,pn⊤ are linearly independent.
In the conjugate gradient algorithm we need to choose the vectors p1,…,pn to be orthogonal with respect to A: pi⊤Apj=0 for i=j [11].
We now explain briefly why Sinkhorn scaling algorithm for matrices can be stated as a partial minimization of strict convex function. For simplicity of exposition ourselves mainly to positive rectangular matrices B=[bi,j]∈Rl×m. For u=(u1,…,ul)⊤∈Rl we denote by D(u)∈Rl×l the diagonal matrix with the diagonal entries eu1,…,eul. Let 1n=(1,…,1)⊤∈Rn and assume r=(r1,…,rl)⊤,c=(c1,…,cm)⊤ are given positive vectors satisfying 1l⊤r=1m⊤c. The scaling problem is finding u,v such that the matrix D(u)BD(v) has rows and column sums r and c respectively:
[TABLE]
for some u∈Rl,v∈Rm.
Clearly, this problem is equivalent to the scaling problem when we replace r,c with br,bc for some positive b>0. For a given nonzero vector w∈Rn denote by L(w)={x∈Rn,w⊤x=0}. The the dimension of L(w) is n−1 and we identify L(w) with Rn−1. Let
[TABLE]
Clearly, f(x),x=(u,v) is a convex function on Rl+m. We consider the restriction of f to L(r)×L(c). Since B>0 it follows that f(x) is strictly convex on L(r)×L(c) and the condition (1.1) holds, see Section 4. Let x⋆=(u⋆,v⋆)∈L(r)×L(c) be the minimum point of f∣L(r)×L(c). Use Lagrange multipliers to deduce that D(u⋆)BD(v⋆) has row and column sums br,bc for some b>0.
Fix v∈L(c) and find partial minimum of min{f(u,v),u∈L(r)}. Use Lagrange multipliers to deduce that this minimum is achieved at unique u(v) such that the row sums of D(u(v))BD(v) are of the form br.
We now give a simple formula for v. Observe first that the equality D(u)BD(v)1m=r is uniquely solvable by
u~i=logri−log(BD(v)1m)i for i∈[l]. Let u~(v)=(u~1,…,u~l)⊤. Note that u~(v) is the scaling part of Sinkhorn algorithm.
Then u(v)=u~(v)−a1l, where a=r⊤u~(v)/(r⊤1l).
Similarly, for a fixed u∈L(r) the minimum of f(u,v) for v∈L(c) is achieved for unique v(u) which can be obtained as follows. First by use Sinkhorn scaling to D(u)BD(v~(u)) to have the column sum c. Second let v(u)=v~(u)−(c⊤v~(u)/c⊤1m)1m.
Since d=2 the partial minimization algorithm is completely equivalent to Sinkhorn minimization algorithm. The geometric rate of convergence depends on the estimates of the eigenvalues of Hessian on the sublevel set f(x)≤f(x0) in L(r)×L(c).
In the case where B has some zero entires then the scaling problem is solvable if and only if there exist a nonnegative matrix C=[ci,j]∈Rl×m with the same [math] pattern as B, (bi,j=0⟺ci,j=0), and with the row and column sums r,c [14].
The existence of such C is a linear programming problem that can be solved in polynomial time [12, 13, 8]. If B can be scaled, it is possible to convert the scaling problem to partial minimization of f(x) on a corresponding subspace of L⊂L(r)×L(c).
We now summarize the contents of the paper. In Section 2 we show that our algorithm converges geometrically to x⋆: the unique minimum point of f. Denote by V(t)={x∈Rn,f(x)≤t}
the compact convex sublevel set corresponding to t≥t⋆=f(x⋆).
Let 0<α(t)≤β(t) be the minimum and the maximum of the smallest and the biggest eigenvalues of the Hessian H(f) in V(t). Let κ(t)=α(t)β(t). Set tk=f(xk), where xk are given by our algorithm. Then tk is a strictly decreasing sequence which converges to t⋆, unless the algorithm reaches x⋆ in a finite number of steps .
Theorem 2.4 shows that the rate of convergence of xk to x⋆ and tk to t⋆ is at least of order (1−(d−1)κ(t0)1)k−1. More precisely,
[TABLE]
In Section 3 we recall our results on tensor scaling [8]. Assume that B=[bi1,…,id]∈Rm1×…×Rmd is a given nonnegative d-mode tensor. Let x=(x1,…,xd), where xj=(xj,1,…,xd,mj)⊤∈Rmj. A scaling of B is the tensor B(x)=[ex1,i1+…+xd,idbi1,…,id]. Let sj be positive probability vectors in Rmj for j∈[d]. Then the scaling problem is to find x such that the j-th slice sum of B(x), obtained by summing on the indices i1,…,ij−1,ij+1,…,id, is sj for each j∈[d]. If B is positive then
such scaling exists. If B has zero entries then such scaling exists if and only if there exists a nonnegative tensor C with the same [math] pattern as B and with the sum slices s1,…,sd [3, 7, 8]. We show that if scaling of B exists then it can be achieved by finding the minimum of the strict convex function f on a subspace L⊂L(s1)×…×L(sd).
In Section 4 we discuss the application of our algorithm to tensor scaling. In the case where B positive, or more general, where the strict convex function
f is defined on the whole L(s1)×…×L(sd), our algorithm applies straightforward. For matrices, d=2 it is exactly the Sinkhorn scaling algorithm, which was explained above. In the case of tensors, d≥3, the algorithm chooses each time the scaling slice. In the case where f is strictly convex on a subspace L⊂L(s1)×…×L(sd), we describe a simple modification of our algorithm and justify its geometric convergence.
In Section 5 we show that our algorithm applies also to a generalized discrete Schrödinger’s bridge problem. (The discrete Schrödinger’s bridge problem is a scaling of a given column stochastic matrix to another column stochastic matrix B so that Ba=b, where a,b are two given positive probabiitiy vectors [10, 9].)
2. The convergence of the algorithm
Lemma 2.1**.**
Let f∈C2(Rn) be strictly convex. Then the following conditions are equivalent:
(1)
The function f has a unique minimum x⋆∈Rn.
2. (2)
The condition (1.1) holds.
Proof.
(1)⇒(2). Let Sn−1 be the n−1 dimensional sphere ∥y−x⋆∥=1.
Fix y∈Sn−1. Consider the strict convex function in one variable: gy(t)=f(x⋆+t(y−x⋆)). Then gy′(0)=0 and gy′(1)=∇f(y)⊤(y−x⋆)>0. Let ν=min{gy′(1),y∈Sn−1}. Clearly, ν>0. As gy′(t) increases for t>0 it follows that gy′(t)≥gy′(1) for t≥1. In particular,
[TABLE]
Hence f(x)≥f(x⋆)+ν(∥x∥−1) if ∥x−x⋆∥≥1. This inequality yields (1.1).
(2)⇒(1) Fix x0∈Rn. Then there exists r>0 such that min{f(x),∥x−x0∥=r}>f(x0). Let min{f(x),∥x−x0∥≤r}=f(x⋆). Clearly, ∥x⋆−x0∥<r. Therefore ∇f(x⋆)=0. As f(x) is convex we deduce that f(x)≥f(x0) for each x∈Rn. As f(x) is strictly convex x⋆ is the unique point of minimum of f.
∎
Note that the function f(x)=ex,x∈R is strictly convex on R but f(x) does not have a minimum on R.
In what follows we assume that f∈C2(Rn) is strictly convex and x⋆ is the unique minimum point of f. Then for each x∈Rn∖x⋆ the sublevel set
[TABLE]
is a compact strictly convex set, with a C2 boundary ∂V(t), with an interior containing x⋆. Let t⋆=f(x⋆). Then V(t⋆)={x⋆}.
Thus Rn∖{x⋆} is parametrized by ∂V(t),t>t⋆.
Fix t0=f(x0)>t⋆. Then f is uniformly strictly convex in V(t0): The eigenvalues of H(f)(x),x∈V(t0) are in a fixed interval [α(t0),β(t0)] for some 0<α(t0)≤β(t0).
Thus for each x,y∈V(t0) we have the inequalities:
[TABLE]
In particular, for x∈V(t0) we have
[TABLE]
Denote by B(x,R2) the closed ball {y∈Rn,∥x−y∥2≤R2}.
Let κ(t0)=α(t0)β(t0) and define
[TABLE]
In what follows we need the following lemma:
Lemma 2.2**.**
Assume that x∈V(t0). Let
[TABLE]
Then
(1)
f(xa)≤f(x).
2. (2)
[x,xa]⊂V(t0).
3. (3)
f(x)−f(x⋆)≥f(x)−f(x+)≥2β(t0)∥∇f(x)∥2.
Proof.
(1) If ∇f(x)=0, i.e., x=x⋆ the (1) trivially holds. Suppose that ∇f(x)=0 and assume to the contrary that f(xa)>f(x). Let h(t)=f(x−t∇f(x)). Then h′(0)=−∥∇f(x)∥2. Recall that h(t) is a strict convex function. Hence there exists t1∈(0,β(t0)2) such that h′(t1)=0 and h′(t)>0 for t>t1. Thus there exists t2∈(t1,β(t0)2) such that f(y)=f(x) for y=x−t2∇f(x)). Note that y∈V(t0). This contradicts the inequality (2.2).
(2) As f(xa)≤f(x)≤t0 the convexity of f yields that the interval [x,xa] is in V(t0).
(3) Clearly x+=21(x+xa)∈[x,xa]. Hence
[TABLE]
Therefore (3) holds.
∎
We now bring the following simple lemma which is basically in [6]:
Lemma 2.3**.**
Assume that f∈C2(Rn) is strictly convex and x⋆ is the unique minimum point of f. Fix x∈V(t0) and assume that x⋆∈B(x,R02). Then we can choose R0=R(x) and the following conditions hold:
[TABLE]
Proof.
As x∈V(t0) the left hand side of (2.3) yields that x⋆∈B(x,R(x)2), where R(x)2 is given by (2.7). Clearly
This proves the first part of (2.8). Hence the inequality in (2.7) holds.
Use part (3) of Lemma 2.2
to replace f(x) in the first part of (2.8) by a smaller quantity f(x+)+2β(t0)∥∇f(x)∥2 to obtain the second part of (2.8).
We now show that in our algorithm the sequences xk,f(xk),k∈N converge geometrically to x⋆,f(x⋆) respectively:
Theorem 2.4**.**
Assume that f∈C2(Rn) is a strict convex function which has a unique minimum point x⋆. Let x0∈Rn and xk,k∈N be given by our algorithm. Set tk=f(xk) for k∈Z+. Assume that the eigenvalues of H(f)(x),x∈V(tk) are in the minimal interval [α(tk),β(tk)], where 0<α(tk)≤β(tk). Denote κ(tk)=α(tk)β(tk).
(1)
If xk−1=x⋆ for some k∈N then tk−1>tk.
2. (2)
The sequences {tk},{β(tk)},{−α(tk)},{κ(tk)},k∈Z+ are nonincreasing sequences which converge to t⋆,β(t⋆),−α(t⋆),κ(t⋆) respectively.
3. (3)
For each k∈N the following inequalities hold:
[TABLE]
Proof.
Note
[TABLE]
(1) Clearly if xk−1=x⋆ then xp=x⋆ for p≥k.
Assume that xk−1=x⋆. Then ∥∇f(xk−1)∥>0.
Let jk−1∈argmax{∥∇lf(xk−1)∥,l∈[d]}. Then ∥∇jk−1f(xk−1)∥>0. Hence tk−1>tk.
(2) As {tk},k∈Z+ is a nonincreasing sequence we deduce that V(tk)⊆V(tk−1) for k∈N. Hence the sequence {α(tk)},k∈N is a nonincreasing, and the sequences {β(tk)},k∈N and {κ(tk)},k∈N
are nondecreasing. The equality limk→∞tk=t⋆ follows from (2.11). The inequality (2.13) yields
[TABLE]
(3) First we show the inequality (2.11) for k=1. Assume that j0∈argmax{∥∇lf(x0)∥,l∈[d]}. Hence ∥∇j0f(x0)∥≥d∥∇f(x0)∥. Let g(xj0)=f(x0j0,xj0), where x0=(x0j0,xj0,0). Thus g is a strictly convex function, whose Hessian is a submatrix of the Hessian of f.
Hence the eigenvalues of the Hessian of g are also in the interval [α(t0),β(t0)]. Recall that argming=xj0⋆=xj(x0j0). Then x1=(x0j0,xj0⋆). We now estimate from below g(xj0,0)−g(xj0⋆). The lower bound (3) of Lemma (2.2) yields:
[TABLE]
The inequality (2.7) yields f(x0)−f(x⋆)≤2α(t0)∥∇f(x0)∥2. Assuming that f(x0)>f(x⋆) we obtain
[TABLE]
This proves the first inequality in (2.11) for k=1.
Assume now that k=2. The definition of x1 yields that ∇j0f(x1)=0.
Hence max{∥∇lf(x1)∥,l∈[d]}≥d−1∥∇f(x1)∥.
Use the same arguments as above to show that f(x2)−f(x⋆)≤(f(x1)−f(x⋆))(1−(d−1)κ(t1)1). Hence (2.11) holds for k=2. Similarly, the inequality (2.11) holds for each k≥2.
Use the inequality (2.7) to deduce the inequality in (2.12).
As κ(tk)≤κ(t0) for each k∈N we deduce the inequality below (2.12).
According to Lemma 2.3x⋆∈B(xk,R2(xk)). Use (2.12) to deduce (2.13).
∎
Observe that our algorithm is an alternating algorithm for d=2 after the first step.
3. The tensor scaling problem
In this section we first recall briefly the results in [8] that we need.
For positive integers d,m1,…,md
denote by Rm1×…×md the linear space d-mode tensors
A=[ai1,i2,…,id],ij∈[mj],j∈[d]. Note that a 1-mode tensor is a vector, and a 2-mode tensor is a matrix.
Assume that d≥2. For a fixed ik∈[mk] the (d−1)-mode tensor [ai1,…,id],ij∈[mj,j∈[d]\{k} is called the (k,ik)slice of A.
For d=2 the (1,i) slice and the (2,j) slice are the i−th row and the j−th column of a given matrix. In the rest of the paper we assume:
[TABLE]
Let
[TABLE]
be the (k,ik)-slice sum. Denote
[TABLE]
the k-slice sum. Note that k-slice sums satisfy the compatibility conditions
[TABLE]
Two d-mode tensors A=[ai1,i2,…,id],B=[bi1,i2,…,id]∈Rm1×…×md
are called positive diagonally equivalent if there exist xk=(xk,1,…,xk,mk)⊤∈Rmk,k∈[d]
such that
ai1,…,id=ex1,i1+…+xd,idbi1,…,id for all ij∈[mj]
and j∈[d].
Denote by R+m1×…×md the cone of nonnegative,(entrywise), d-mode tensors.
We assume that B=[bi1,i2,…,id]∈R+m1×…×md is a given nonnegative tensor with no zero slice (k,ik).
Let sk∈R+mk,k∈[d] are given k positive vectors satisfying the conditions
(3.4).
Denote by R+m1×…×md(B,s1,…,sd) the set of all nonnegative
A=[ai1,i2,…,id]∈R+m1×…md having the same zero pattern as B, i.e. ai1,…,id=0⟺bi1,…,id=0
for all indices i1,…,id, and satisfying the condition (3.2).
We now recall the necessary and sufficient conditions on B so that
R+m1×…md(B,s1,…,sd)
contains a tensor A, which is positively diagonally equivalent to B. For matrices, i.e. d=2,
this problem was solved by
Menon [14] and Brualdi [4]. See also [15]. For the special case of positive
diagonal equivalence
to doubly stochastic matrices see [5] and [20].
The result of Menon was extended for tensors independently by Bapat-Raghavan [3] and
Franklin-Lorenz [7]. (See [2] and [16] for the special case where all the
entries of B are positive.) In [8] we gave necessary and sufficient
conditions for the solution of this problem:
Theorem 3.1**.**
Let B=[bi1,i2,…,id]∈R+m1×…×md,
(d≥2), be a
given nonnegative tensor with no (k,ik)-zero slice. Let sk∈R+mk,k=1,…,d be given
positive vectors satisfying (3.4).
Then there exists a nonnegative tensor A∈R+m1×…×md, which is positive
diagonally equivalent
to B and having each (k,ik)-slice sum equal to sk,ik, if and only the following conditions hold:
The system of the inequalities and equalities for xk=(xk,1,…,xk,mk)⊤∈Rmk,k=1,…,d,
[TABLE]
imply one of the following equivalent conditions
(1)
x1,i1+x2,i2+…+xd,id=0* if bi1,i2,…,id>0.*
2. (2)
∑bi1,i2,…,id>0x1,i1+x2,i2+…+xd,id=0.
In particular, there exists at most one tensor A∈R+m1×…×md with (k,ik)-slice
sum sk,ik for all k,ik, which is positive diagonally equivalent to B.
The above yields the following corollary.
Corollary 3.2**.**
Let B∈R+m1×…×md, (d≥2), be a
given nonnegative tensor with no (k,ik)-zero slice. Let sk∈R+mk,k=1,…,d be given
positive vectors. Then there exists a nonnegative tensor C∈R+m1×…×md,
which is positive diagonally equivalent
to B and each (k,ik)-sum slice equal to sk,ik, if and only if there exists a nonnegative tensor
A=[ai1,i2,…,id]∈R+m1×…×md,
having the same zero pattern as B, which satisfies (3.2).
For matrices, i.e. d=2, the above corollary is due Menon [14].
For d=3 this result is due to [3, Thm 3] and for d≥3 [7].
Brualdi in [4] gave a nice and simple characterization for the set of nonnegative matrices, with
prescribed zero pattern and with given positive row and column sums, to be not empty.
It is an open problem to find an analog of Brualdi’s results for d-mode tensors, where d≥3.
Note that the conditions of Theorem 3.1 are stated as a
linear programming problem. Hence the existence of a positive diagonally equivalent tensor A can
be determined in polynomial time [12, 13]. If such A exists, it is shown in [8] that A can be found
by computing the unique minimal point of certain strictly convex functions f.
Note that B>0 is always scalable as the tensor A=bs1⊗⋯⊗sd satisfies (3.2) for b=(1m1⊤s1)d−1.
Identify Rm1×Rm2×…×Rmd with Rn+d, where n+d=∑k=1dmk.
We view x∈Rn+d as a vector (x1⊤,…,xd⊤)⊤=(x1,…,xd), where xk∈Rmk,k∈[d].
Let ∥x∥:=x⊤x. Define
[TABLE]
Clearly, f^ is a convex function on Rn+d.
Denote by U(s1,…,sd)⊂Rn+d the subspace of vectors (x1,…,xd) satisfying the equalities (3.6). Thus U(s1,…,sd)=L(s1)×⋯×L(sd)≡Rn.
In [8] we showed the following lemma:
Lemma 3.3**.**
Let B=[bi1,i2,…,id]∈R+m1×…×md,
(d≥2), be a given nonnegative tensor with no (k,ik)-zero slice. Let sk∈R+mk,k=1,…,d
be given positive vectors satisfying (3.4). Then there exists a nonnegative tensor
A∈R+m1×…×md,
which is positive diagonally equivalent to B and having each (k,ik)-slice sum equal to sk,ik, if
and only the restriction of f^ to the subspace U(s1,…,sd), denoted as f~, has a critical point.
Denote by V(s1,…,sd) the subspace of all vectors
(x1,…,xd)
satisfying the condition 1 of Theorem 3.1.
Clearly, for each x∈Rn+d the function f^ has a constant value f^(x) on the affine set x+V(s1,…,sd). Let V0(s1,…,sd)=V(s1,…,sd)∩U(s1,…,sd) Hence, if \mbox{\boldmath{\eta}}\in\mathbf{U}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d}) is a critical point of f~
then any point in \mbox{\boldmath{\eta}}+\mathbf{V}_{0}(\mathbf{s}_{1},\ldots,\mathbf{s}_{d}) is also a critical of f~.
Denote by V(s1,…,sd)⊥⊂Rn+d the orthogonal complement of V(s1,…,sd) in Rn+d, and by
V0(s1,…,sd)⊥, the orthogonal complement of
V0(s1,…,sd) in U(s1,…,sd). In [8] we showed:
Lemma 3.4**.**
Let B=[bi1,i2,…,id]∈R+m1×…×md,
(d≥2), be a given nonnegative tensor with no (k,ik)-zero slice. Let sk∈R+mk,k=1,…,d
be given positive vectors satisfying (3.4). Let U(s1,…,sd),V0(s1,…,sd),V0(s1,…,sd)⊥ be defined as above.
Then the restriction of f~ to V0(s1,…,sd)⊥, denoted as f, is strictly convex. That is, H(f)
has positive eigenvalues at each point of V0(s1,…,sd)⊥.
Theorem 3.5**.**
Let B=[bi1,i2,…,id]∈R+m1×…×md,
(d≥2), be a given nonnegative tensor with no (k,ik)-zero slice. Let sk∈R+mk,k=1,…,d
be given positive vectors satisfying (3.4). Then the following conditions are equivalent.
(1)
f~* has a global minimum.*
2. (2)
f~* has a critical point.*
3. (3)
limf(xl)=∞* for any sequence xl∈V0(s1,…,sk)⊥
such that lim∥xl∥=∞.*
4. (4)
The only x=(x1,…,xd)∈V0(s1,…,sk)⊥
that satisfies (3.5) is x=0n.
4. The scaling algorithm for tensors
In this section we assume that a given B=[bi1,…,id]∈R+m1×⋯×R+md satisfies one of the equivalent conditions of Theorem 3.5. Let B(x)=[bi1,…,idex1,i1+⋯+xd,id].
Hence f has a unique minimum point x⋆∈V0(s1,…,sd)⊥.
We now describe our algorithm for finding x⋆.
We first consider the case where V0(s1,…,sd)={0}. That is, the system of linear equations given by (3.6) and by the conditions (1) of Theorem 3.5 has only the trivial solution x1=⋯=xd=0.
This condition is satisfied if all the entries of B are positive. Indeed, assume that B>0. Sum up the equations in condition (1) on i2,…,id to deduce that x1=t11m1. Similarly, we deduce that xj=tj1mj for all j∈[d]. Furthermore the M=∏j=1dmj equations of (1) are equivalent to one equaiton:
t1+⋯+td=0. The conditions (3.6) yield that t1=⋯=td=0.
In this case f~=f is a function defined on U(s1,…,sd). We identify
U(s1,…,sd) with Rn=Rm1−1×⋯Rmd−1.
Then our algorithm is applied straightforward as in the case d=2, which is described in Section 1:
Fix xj and find the unique x~j=x~j(xj) which satsfies the condition
[TABLE]
Let xj(xj)=x~j−sj⊤1mjsj⊤x~j1mj. Clearly,
xj(xj)∈L(sj). Hence xj(xj) is the critical point of the strict convex function gxj(xj)=f(xj,xj) on L(sj)≡Rmj−1.
We now can apply Theorem 2.4. Our algorithm will converge to a unique minimal point x⋆∈U(s1,…,sd). The tensor B(x⋆) will have its d sum slices of the form bs1,…,bsd for some b>0.
We now discuss the case where V0(s1,…,sd) is a nontrivial subspace of U(s1,…,sd). In that case we claim that our algorithm applies with a suitable modification. First observe that
[TABLE]
Let
[TABLE]
be the orthogonal projection on V(s1,…,sd)⊥ and V0(s1,…,sd)⊥ respectively. Then
[TABLE]
If x∈Rn then y=Px∈V(s1,…,sd)⊥,z=(I−P)x∈V(s1,…,sd). If x∈U(s1,…,sd) then y=P0x∈V0(s1,…,sd)⊥
and z=(I−P0)x∈V0(s1,…,sd).
Observe that f^(x+z)=f^(x) for x∈Rn and z∈V(s1,…,sd). Hence
[TABLE]
Similarly f~(x+z)=f~(x) for x∈U(s1,…,sd) and z∈V0(s1,…,sd). Furthermore
[TABLE]
(The simplest way to show these identities is by considering an orthonormal basis in in U(s1,…,sd) consisting of vectors in orthonormal bases of V0(s1,…,sd)⊤ and V0(s1,…,sd). Then change to a basis of U(s1,…,sd)=L(s1)×⋯×L(sd) which is a union of orthonormal bases of L(sj) for j∈[d].)
Observe that for x∈U(s1,…,sd) the gradient ∇f~(x) is a subvector of ∇f^(x), when we choose the corresponding coordinates in Rm1×⋯Rmd. Similarly, for x∈V0(s1,…,sd)⊥ the gradient ∇f(x) is a subvector of ∇f~(x), if we choose the coordinates using the orthonormal bases in V0(s1,…,sd)⊥ and V0(s1,…,sd) respectively. Moreover, the coordinates of ∇f~(x) corresponding to the chosen orthonormal basis in V0(s1,…,sd) are zero.
Hence for x∈V0(s1,…,sd)⊥ the gradient ∇f(x) is obtained by deleting the zero coordinates of ∇f~(x) corresponding to the chosen orthonormal basis of V0(s1,…,sd).
In particular we have the equality
[TABLE]
Lemma 4.1**.**
Assume that a given B=[bi1,…,id]∈R+m1×⋯×R+md satisfies the assumptions of Theorem 3.5 and one of its equivalent conditions .
Suppose furthermore that dimV0(s1,…,sd)>0. For j∈[d] let Wj be the following subspace of V0(s1,…,sd)⊥: {w=P0(0,xj),xj∈L(sj)}. Then
(1)
The dimension of Wj is mj−1 for j∈[d].
2. (2)
W1+⋯+Wd=V0(s1,…,sd)⊥. Furthermore, V0(s1,…,sd)⊥ is not a direct sum of W1,…,Wd.
3. (3)
Let x∈V0(s1,…,sd)⊥ and j∈[d]. Choose an orthonormal basis in Wj and denote by ∇fWj(x) the gradient of f with respect to the chosen orthonormal basis of the subspace Wj. Then
[TABLE]
Proof.
(1) In view of the assumption (3.1) it follows that dimL(sj)=mj−1≥1. Assume to the contrary that dimWj<mj−1, Then there exists xj∈L(sj)∖{0} such that P0(0,xj)=0. Use the first equality
of(4.1) to deduce that f~((0,txj))=f(P0(0,txj))=f(0) for each t∈R.
As B is a nonnegative tensor with no (k,ik)-zero slice it follows that
[TABLE]
As xj=0 the above function of t can’t be a constant function. Hence dimWj=mj−1.
(2) Let x=(x1,…,xd)∈L(s1)×⋯×L(sd). Then x=∑j=1d(0,xj). Hence P0x=∑j=1dP0(0,xj). Therefore W1+⋯+Wd=V0(s1,…,sd)⊥. Clearly
[TABLE]
Hence V0(s1,…,sd)⊥ is not a direct sum of W1,…,Wd.
(3) If ∇jf~(x)=0 the inequality (4.3) trivially holds. Assume that ∇jf~(x)=0. Let wj=∥∇jf~(x)∥1∇jf~(x).
Then ∥∇jf~(x)∥=∇f~(x)⊤(0,wj). Let
[TABLE]
As ∇f~(x)⊤vj=0 we deduce that
[TABLE]
Use (4.2) and the above inequalities to deduce (4.3).
∎
We now give the modified algorithm:
Modified algorithm
Choose x0∈V0(s1,…,sd)⊥.
for k:=0,1,2,…
j∈argmax{∥∇Wlf(xk)∥,l∈[d]}
xk+1=P0(xkj,xj(xkj))
end
We explain and justify the modified algorithm.
View x0 as a point in U(s1,…,sd).
Then xj(x0j) is a critical point of the strict convex function gx0j(xj)=f~(x0j,xj),xj∈L(sj) as in the beginning of this section. Let x1′=x0+(0,xj(x0j)−xj,0). Clearly x1′∈U(s1,…,sd). Note that ∇jf~(x1′)=0. Hence f~(x1+(0,xj))≥f(x1) for each xj∈L(sj).
Let x1=P0x1′=x0+P0(0,xj(x0j)−xj,0). The first equality of (4.1) yields:
[TABLE]
Hence ∇Wjf(x1)=0. Therefore x1 is the minimum of f on the affine space x0+Wj. Inequality (4.3) yields that ∥∇Wjf(x0)∥≥d∥f(x0)∥, as in the case of the original algorithm. As ∇Wjf(x1)=0 we deduce from (4.3) that ∥∇Wjf(xk)≥d−1∥f(xk)∥ for k=1. Same inequality holds for all k≥1. Hence Theorem 2.4 applies in this case too.
5. A generalization of discrete Schrödinger’s bridge problem
The classical Schrödinger bridge problem, studied by Schrödinger in [17, 18], seeks the most likely probability law for a diffusion process, in path space, that matches marginals at two end points in time.
The discrete version of Schrödinger’s bridge problem for Markov chains can be stated as follows [10, 9]:
Problem 5.1**.**
Let A∈R+n×n be a column stochastic matrix. Assume that a,b are two positive probability vectors. Does there exists a scaling of A, denoted as B, such that B is column stochastic and Ba=b?
We give a necessary and sufficient condition for a solution to generalized Schrödinger’s bridge problem:
Theorem 5.2**.**
Let A∈R+m×n be a given matrix.
Assume that b∈Rm,a,c∈Rn be given positive vectors that satisfy c⊤a=1m⊤b. Then there exists a scaling of A, denoted as B, such that
[TABLE]
if and only if the following conditions holds:
There exists C∈R+m×n with the same [math]-pattern as B that satisfies (5.1). If this condition holds then B is unique and can be found by the modified algorithm.
Proof.
Assume that a=(a1,…,an)⊤,c=(c1,…,cn)⊤.
Denote by D(a)∈Rn×n the diagonal matrix whose diagonal entries are the coordinates of a. Let A~=AD(a) and consider the scaling of B=D1A~D2 with the row sum b and column sum c∘a=(c1a1,…,cnan)⊤. Note that condition 1n⊤(c∘a)=1m⊤b is the condition c⊤a=1m⊤b. Next observe that this scaling of A~ is equivalent to the scaling of A which satisfies (5.1). The result of [14] yields that B exists if and only if there exists C∈R+m×n with the same [math]-pattern as B that satisfies (5.1). Use the modifed algorithm to find the scaling of A~.
∎
Bibliography20
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1[1] Z. Allen-Zhu, Y. Li, R. Oliveira and A. Wigderson, Much faster algorithms for matrix scaling, ar Xiv:1704.02315.
2[2] R.B. Bapat D 1 A D 2 subscript 𝐷 1 𝐴 subscript 𝐷 2 D_{1}AD_{2} theorems for multidimensional matrices, Linear Algebra Appl. 48 (1982), 437–442.
3[3] R.B. Bapat and T.E.S. Raghavan, An extension of a theorem of Darroch and Ratcliff in loglinear models and its application to scaling multidimensional matrices, Linear Algebra Appl. 114/115 (1989), 705-715.
4[4] R.A. Brualdi, Convex sets of nonnegative matrices, Canad. J. Math 20 (1968), 144-157.
5[5] R.A. Brualdi, S.V. Parter and H. Schneider, The diagonal equivalence of a nonnegative matrix to a stochastic matrix, J. Math. Anal. Appl. 16 (1966), 31–50.
6[6] S. Bubeck, Y. T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent, ar Xiv:1506.08187.
7[7] J. Franklin and J. Lorenz, On the scaling of multidimensional matrices, Linear Algebra Appl. 114/115 (1989), 717-735.
8[8] S. Friedland, Positive diagonal scaling of a nonnegative tensor to one with prescribed slice sums, Linear Algebra Appl. , vol. 434 (2011), 1615-1619.