Globally convergent Jacobi-type algorithms for simultaneous orthogonal   symmetric tensor diagonalization

Jianze Li; Konstantin Usevich; Pierre Comon

arXiv:1702.03750·math.NA·July 28, 2017

Globally convergent Jacobi-type algorithms for simultaneous orthogonal symmetric tensor diagonalization

Jianze Li, Konstantin Usevich, Pierre Comon

PDF

Open Access

TL;DR

This paper introduces and proves the global convergence of new Jacobi-type algorithms for the simultaneous orthogonal diagonalization of symmetric tensors and matrices, advancing tensor decomposition methods.

Contribution

It extends Jacobi algorithms to symmetric tensors and establishes their global convergence, including a new algorithm for smooth functions.

Findings

01

Proved global convergence for existing Jacobi algorithm on matrices and third-order tensors.

02

Developed a new Jacobi-based algorithm with proven convergence for smooth functions.

03

Enhanced tensor diagonalization techniques for symmetric tensors.

Abstract

In this paper, we consider a family of Jacobi-type algorithms for simultaneous orthogonal diagonalization problem of symmetric tensors. For the Jacobi-based algorithm of [SIAM J. Matrix Anal. Appl., 2(34):651--672, 2013], we prove its global convergence for simultaneous orthogonal diagonalization of symmetric matrices and 3rd-order tensors. We also propose a new Jacobi-based algorithm in the general setting and prove its global convergence for sufficiently smooth functions.

Figures14

Click any figure to enlarge with its caption.

Equations199

(T ∙_{k} M)_{i_{1}, ..., i_{d}} = def j = 1 \sum n_{k} T_{i_{1}, \dots, i_{k - 1}, j, i_{k + 1} \dots, i_{d}} M_{i_{1}, j}

(T ∙_{k} M)_{i_{1}, ..., i_{d}} = def j = 1 \sum n_{k} T_{i_{1}, \dots, i_{k - 1}, j, i_{k + 1} \dots, i_{d}} M_{i_{1}, j}

\operator@font d ia g {T} = def [T_{1 \dots 1} \dots T_{n \dots n}]^{T} .

\operator@font d ia g {T} = def [T_{1 \dots 1} \dots T_{n \dots n}]^{T} .

Q_{*} = ar g Q \in SO_{n} max ℓ = 1 \sum m ∥ \operator@font d ia g {W^{(ℓ)}} ∥^{2},

Q_{*} = ar g Q \in SO_{n} max ℓ = 1 \sum m ∥ \operator@font d ia g {W^{(ℓ)}} ∥^{2},

Q^{*} = ar g Q \in SO_{n} max f (Q) .

Q^{*} = ar g Q \in SO_{n} max f (Q) .

Q_{*} = ar g Q \in SO_{n} min ℓ = 1 \sum m ∥ \operator@font o f f d ia g {W^{(ℓ)}} ∥^{2},

Q_{*} = ar g Q \in SO_{n} min ℓ = 1 \sum m ∥ \operator@font o f f d ia g {W^{(ℓ)}} ∥^{2},

f (Q) = ∥ A ∙_{1} M Q^{T} ∙_{2} M Q^{T} ∙_{3} M Q^{T} ∥^{2}

f (Q) = ∥ A ∙_{1} M Q^{T} ∙_{2} M Q^{T} ∙_{3} M Q^{T} ∥^{2}

G^{(i, j, θ)} = 1 ⋱ 0 cos θ sin θ ⋱ - sin θ cos θ 0 ⋱ 1,

G^{(i, j, θ)} = 1 ⋱ 0 cos θ sin θ ⋱ - sin θ cos θ 0 ⋱ 1,

(G^{(i, j, θ)})_{k, l} = ⎩ ⎨ ⎧ 1, cos θ, sin θ, - sin θ, 0, k = l, k \neq \in {i, j}, k = l, k \in {i, j}, (k, l) = (j, i), (k, l) = (i, j), otherwise

(G^{(i, j, θ)})_{k, l} = ⎩ ⎨ ⎧ 1, cos θ, sin θ, - sin θ, 0, k = l, k \neq \in {i, j}, k = l, k \in {i, j}, (k, l) = (j, i), (k, l) = (i, j), otherwise

\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)\stackrel{{\scriptstyle\sf def}}{{=}}{f}(\boldsymbol{Q}_{k-1}\boldsymbol{G}^{(i_{k},j_{k},\theta)}).

\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)\stackrel{{\scriptstyle\sf def}}{{=}}{f}(\boldsymbol{Q}_{k-1}\boldsymbol{G}^{(i_{k},j_{k},\theta)}).

Q_{k} = U_{1} \dots U_{k - 1} U_{k},

Q_{k} = U_{1} \dots U_{k - 1} U_{k},

(1, 2) \to (1, 3) \to \dots \to (1, n) \to (2, 3) \to \dots \to (2, n) \to \dots \to (n - 1, n) \to (1, 2) \to (1, 3) \to \dots .

(1, 2) \to (1, 3) \to \dots \to (1, n) \to (2, 3) \to \dots \to (2, n) \to \dots \to (n - 1, n) \to (1, 2) \to (1, 3) \to \dots .

\delta_{i,j}=\frac{d}{d\theta}\boldsymbol{G}^{(i,j,\theta)}\Big{|}_{\theta=0}=\begin{bmatrix}0&&&&&&\\ &\ddots&&&&\boldsymbol{0}&\\ &&0&&-1&&\\ &&&\ddots&&&\\ &&1&&0&&\\ &\boldsymbol{0}&&&&\ddots&\\ &&&&&&0\end{bmatrix}

\delta_{i,j}=\frac{d}{d\theta}\boldsymbol{G}^{(i,j,\theta)}\Big{|}_{\theta=0}=\begin{bmatrix}0&&&&&&\\ &\ddots&&&&\boldsymbol{0}&\\ &&0&&-1&&\\ &&&\ddots&&&\\ &&1&&0&&\\ &\boldsymbol{0}&&&&\ddots&\\ &&&&&&0\end{bmatrix}

d_{i, j} (Q) = def Q δ_{i, j}

d_{i, j} (Q) = def Q δ_{i, j}

\operator@font P r o j \nabla f (Q) = def Q Λ (Q),

\operator@font P r o j \nabla f (Q) = def Q Λ (Q),

Λ (Q) = def \frac{Q ^{T} \nabla f ( Q ) - ( \nabla f ( Q ) ) ^{T} Q}{2}

Λ (Q) = def \frac{Q ^{T} \nabla f ( Q ) - ( \nabla f ( Q ) ) ^{T} Q}{2}

∣ ⟨ \operator@font P r o j \nabla f (Q), d_{i, j} (Q)⟩ ∣ \geq ε ∥ \operator@font P r o j \nabla f (Q) ∥,

∣ ⟨ \operator@font P r o j \nabla f (Q), d_{i, j} (Q)⟩ ∣ \geq ε ∥ \operator@font P r o j \nabla f (Q) ∥,

∣ h_{k}^{'} (0) ∣ = ∣ ⟨ \operator@font P r o j \nabla f (Q_{k - 1}), d_{i_{k}, j_{k}} (Q_{k - 1})⟩ ∣.

∣ h_{k}^{'} (0) ∣ = ∣ ⟨ \operator@font P r o j \nabla f (Q_{k - 1}), d_{i_{k}, j_{k}} (Q_{k - 1})⟩ ∣.

2∣ Λ_{i, j} (Q) ∣ \geq ε ∥ Λ (Q) ∥.

2∣ Λ_{i, j} (Q) ∣ \geq ε ∥ Λ (Q) ∥.

(i,j)=\arg\max_{1\leq k,l\leq n}|\color[rgb]{0,0,0}{\Lambda}_{k,l}\color[rgb]{0,0,0}(\boldsymbol{Q})|,

(i,j)=\arg\max_{1\leq k,l\leq n}|\color[rgb]{0,0,0}{\Lambda}_{k,l}\color[rgb]{0,0,0}(\boldsymbol{Q})|,

2|\color[rgb]{0,0,0}\Lambda_{i,j}\color[rgb]{0,0,0}(\boldsymbol{Q})|=2\max_{1\leq k,l\leq n}\left|\color[rgb]{0,0,0}{\Lambda}_{k,l}\color[rgb]{0,0,0}(\boldsymbol{Q})\right|\geq\frac{2}{n}\|\boldsymbol{\Lambda}(\boldsymbol{Q})\|\geq\varepsilon\|\boldsymbol{\Lambda}(\boldsymbol{Q})\|.

2|\color[rgb]{0,0,0}\Lambda_{i,j}\color[rgb]{0,0,0}(\boldsymbol{Q})|=2\max_{1\leq k,l\leq n}\left|\color[rgb]{0,0,0}{\Lambda}_{k,l}\color[rgb]{0,0,0}(\boldsymbol{Q})\right|\geq\frac{2}{n}\|\boldsymbol{\Lambda}(\boldsymbol{Q})\|\geq\varepsilon\|\boldsymbol{\Lambda}(\boldsymbol{Q})\|.

f (Q) = j = 1 \sum n W_{j j \dots j}^{2} = j = 1 \sum n (p_{1}, p_{2}, \dots, p_{d} \sum A_{p_{1}, p_{2}, \dots, p_{d}} Q_{p_{1}, j} Q_{p_{2}, j} \dots Q_{p_{d}, j})^{2} .

f (Q) = j = 1 \sum n W_{j j \dots j}^{2} = j = 1 \sum n (p_{1}, p_{2}, \dots, p_{d} \sum A_{p_{1}, p_{2}, \dots, p_{d}} Q_{p_{1}, j} Q_{p_{2}, j} \dots Q_{p_{d}, j})^{2} .

\frac{\partial f}{\partial Q _{i, j}}

\frac{\partial f}{\partial Q _{i, j}}

= 2 W_{j j \dots j} [[+ q (q d) Q_{i, j}^{q - 1} (k_{1}, \dots, k_{d - q} \neq = i \sum Q_{k_{1}, j} \dots Q_{k_{d - q}, j} A_{i, \dots, i, k_{1}, \dots, k_{q}}) + \dots

\displaystyle\phantom{=2\mathcal{W}_{jj\ldots j}[[}+d\sum\limits_{k_{1},\ldots,k_{d-1}\neq i}Q_{k_{1},j}\cdots Q_{k_{d-1},j}\mathcal{A}_{i,k_{1},\ldots,k_{d-1}})\Big{]}

= 2 d W_{j j \dots j} k_{1}, \dots, k_{d - 1} = 1 \sum n Q_{k_{1}, j} \dots Q_{k_{d - 1}, j} A_{i, k_{1}, \dots, k_{d - 1}} = 2 d W_{j j \dots j} V_{ij \dots j},

\nabla f (Q)

\nabla f (Q)

= 2 d Q W_{11 \dots 1} W_{21 \dots 1} \dots W_{n 1 \dots 1} W_{12 \dots 2} W_{22 \dots 2} \dots W_{n 2 \dots 2} \dots \dots \dots \dots W_{1 n \dots n} W_{2 n \dots n} \dots W_{nn \dots n} W_{1 \dots 1} 0 ⋮ 0 0 ⋱ ⋱ \dots \dots ⋱ ⋱ 0 0 ⋮ 0 W_{n \dots n} .

\operator@font P r o j \nabla f (Q) = Q Λ (Q),

\operator@font P r o j \nabla f (Q) = Q Λ (Q),

\color[rgb]{0,0,0}\Lambda_{k,l}\color[rgb]{0,0,0}(\boldsymbol{Q})=\begin{cases}0,&k=l;\\ d(\mathcal{W}_{kl\ldots l}\mathcal{W}_{l\ldots l}\text{-}\mathcal{W}_{k\ldots k}\mathcal{W}_{k\ldots kl}),&k<l;\\ -\color[rgb]{0,0,0}\Lambda_{k,l}\color[rgb]{0,0,0}(\boldsymbol{\mathcal{W}}),&k>l.\end{cases}

\color[rgb]{0,0,0}\Lambda_{k,l}\color[rgb]{0,0,0}(\boldsymbol{Q})=\begin{cases}0,&k=l;\\ d(\mathcal{W}_{kl\ldots l}\mathcal{W}_{l\ldots l}\text{-}\mathcal{W}_{k\ldots k}\mathcal{W}_{k\ldots kl}),&k<l;\\ -\color[rgb]{0,0,0}\Lambda_{k,l}\color[rgb]{0,0,0}(\boldsymbol{\mathcal{W}}),&k>l.\end{cases}

3 Q [0 W_{111} W_{112} - W_{122} W_{222} W_{111} W_{113} - W_{133} W_{333} W_{122} W_{222} - W_{111} W_{112} 0 W_{222} W_{223} - W_{233} W_{333} W_{133} W_{333} - W_{111} W_{113} W_{233} W_{333} - W_{222} W_{223} 0] .

3 Q [0 W_{111} W_{112} - W_{122} W_{222} W_{111} W_{113} - W_{133} W_{333} W_{122} W_{222} - W_{111} W_{112} 0 W_{222} W_{223} - W_{233} W_{333} W_{133} W_{333} - W_{111} W_{113} W_{233} W_{333} - W_{222} W_{223} 0] .

W = A ∙_{1} Q^{T} ∙_{2} Q^{T} \dots ∙_{d} Q^{T},

W = A ∙_{1} Q^{T} ∙_{2} Q^{T} \dots ∙_{d} Q^{T},

T = T (θ) = W ∙_{1} G^{T} (θ) ∙_{2} G^{T} (θ) \dots ∙_{d} G^{T} (θ),

T = T (θ) = W ∙_{1} G^{T} (θ) ∙_{2} G^{T} (θ) \dots ∙_{d} G^{T} (θ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Matrix Theory and Algorithms · Sparse and Compressive Sensing Techniques

Full text

\headers

Globally convergent Jacobi-type algorithmsJ. Li, K. Usevich, and P. Comon

Globally convergent Jacobi-type algorithms

for simultaneous orthogonal

symmetric tensor diagonalization††thanks: Submitted to the editors DATE. \fundingThis work was supported by the ERC project “DECODA” no.320594, in the frame of the European program FP7/2007-2013. The first author was partially supported by the National Natural Science Foundation of China (No.11601371).

Jianze Li School of Mathematics, Tianjin University, Tianjin 300072, China (). [email protected]

Konstantin Usevich GIPSA-Lab, CNRS and Univ. Grenoble Alpes, France (). [email protected]

Pierre Comon33footnotemark: 3

Abstract

In this paper, we consider a family of Jacobi-type algorithms for simultaneous orthogonal diagonalization problem of symmetric tensors. For the Jacobi-based algorithm of [SIAM J. Matrix Anal. Appl., 2(34):651–672, 2013], we prove its global convergence for simultaneous orthogonal diagonalization of symmetric matrices and 3rd-order tensors. We also propose a new Jacobi-based algorithm in the general setting and prove its global convergence for sufficiently smooth functions.

keywords:

orthogonal tensor diagonalization, Jacobi rotation, global convergence, Łojasiewicz gradient inequality, proximal algorithm

{AMS}

15A69, 49M30, 65F99, 90C30

1 Introduction

Higher-order tensor decompositions have attracted a lot of attention in the last two decades because of the applications in various disciplines, including signal processing, numerical linear algebra and data analysis [5, 9, 20]. The most popular decompositions include the Canonical Polyadic and Tucker decompositions, where additional constraints are often imposed, such as symmetry [10], nonnegativity [22] or orthogonality [19].

An important class of tensor approximation problems is approximation with orthogonality constraints. In particular, the orthogonal symmetric tensor diagonalization problem for 3rd and 4th-order cumulant tensors is in the core of Independent Component Analysis [6, 7, 8], and is a popular way to solve blind source separation problems in signal processing [11]. In the same context, simultaneous orthogonal matrix diagonalization [4] is widely used; simultaneous orthogonal tensor diagonalization for slices of 4th-order cumulants is also used to solve source separation problems [16].

Main notations

Let $\mathbb{R}^{n_{1}\times\cdots\times n_{d}}\stackrel{{\scriptstyle\sf def}}{{=}}\mathbb{R}^{n_{1}}\otimes\cdots\otimes\mathbb{R}^{n_{d}}$ denote the space of $d$ th-order tensors. In the paper, the tensors are typeset with a bold calligraphic font (e.g, $\boldsymbol{\mathcal{T}}$ , $\boldsymbol{\mathcal{W}}$ ), and matrices are in bold (e.g, $\boldsymbol{Q}$ , $\boldsymbol{U}$ ); the elements of tensors or matrices are typeset as $\mathcal{T}_{ijk}$ (or sometimes $\mathcal{T}_{i,j,k}$ ) or $Q_{ij}$ respectively.

For a tensor $\boldsymbol{\mathcal{T}}\in\mathbb{R}^{n_{1}\times\cdots\times n_{d}}$ and a matrix $\boldsymbol{M}\in\mathbb{R}^{m\times n_{k}}$ , their $k$ -mode product [20, subsection 2.5] is the tensor $(\boldsymbol{\mathcal{T}}{\mathop{\bullet_{k}}}\boldsymbol{M})\in\mathbb{R}^{n_{1}\times\cdots\times n_{k-1}\times{m}\times n_{k+1}\times\cdots\times n_{d}}$ defined as

[TABLE]

A tensor $\boldsymbol{\mathcal{T}}\in\mathbb{R}^{n\times\cdots\times n}$ is called symmetric if its entries do not change under any permutation of indices; its diagonal is, by definition, the vector

[TABLE]

We will always denote by $\|\cdot\|$ the Frobenius norm of a tensor or a matrix, or the Euclidean norm of a vector. Finally, let $\mathscr{O}_{n}\subset\mathbb{R}^{n\times n}$ denote the orthogonal group, that is, the set of orthogonal matrices. Let $\mathscr{SO}_{n}\subset\mathscr{O}_{n}$ denote the special orthogonal group, the set of orthogonal matrices with determinant $1$ .

Problem statement

In this paper, we consider the following simultaneous orthogonal symmetric tensor diagonalization problem. Let $\{\boldsymbol{\mathcal{A}}^{(\ell)}:1\leq\ell\leq m\}\subset\mathbb{R}^{n\times\cdots\times n}$ be a set of symmetric tensors. We wish to maximize

[TABLE]

where $\boldsymbol{\mathcal{W}}^{(\ell)}=\boldsymbol{\mathcal{A}}^{(\ell)}\mathop{\bullet_{1}}\boldsymbol{Q}^{{\sf T}}\cdots\mathop{\bullet_{d}}\boldsymbol{Q}^{{\sf T}}$ for $1\leq\ell\leq m$ . Problem Eq. 1 has the following well-known problems as special cases:

•

orthogonal tensor diagonalization problem if $m=1$ and $d>2$ ;

•

simultaneous orthogonal matrix diagonalization problem if $m>1$ and $d=2$ .

Several algorithms have been proposed in the literature to solve special cases of problem Eq. 1. The first were the Jacobi CoM (Contrast Maximization) algorithm for orthogonal diagonalization of 3rd and 4th-order symmetric tensors [6, 7, 8] and the JADE (Joint Approximate Diagonalization of Eigenmatrices) algorithm for simultaneous orthogonal matrix diagonalization [4]. An algorithm for simultaneous orthogonal 3rd-order tensor diagonalization was proposed in [14]. These Jacobi-type algorithms have been very widely used in applications [11], and have the advantage that Jacobi rotations can be computed by rooting low-order polynomials. Nevertheless, up to our knowledge, the convergence of these methods was not proved, although it was often observed in practice [8, 13].

Contribution

In this paper, we consider several Jacobi-type algorithms to solve problem Eq. 1, and study their global convergence properties. By global convergence, we mean that, for any starting point, the whole sequence of iterations produced by the algorithm always converges to a limit point111This was also called single-point convergence in [26].. Note that the global convergence does not guarantee convergence to a global maximum, since the cost function is multimodal.

First, we consider the Jacobi-based algorithm proposed in [17] for best low multilinear rank approximation of 3rd-order symmetric tensors. The algorithm uses a gradient-based order of Jacobi rotations, hence we call this algorithm Jacobi-G in our paper. For the Jacobi-G algorithm the convergence of subsequences of iterations to stationary points was established in [17]. We prove that, for problem Eq. 1, Jacobi-G algorithm converges globally and the limit point is always a stationary point in the cases $d=2,3$ . The proof is based on a variant Łojasiewicz theorem developed in [25] for analytic submanifolds of Euclidean space.

Second, we propose a new Jacobi-based algorithm inspired by proximal methods [24], which is called the Jacobi-PC algorithm in this paper. We show that Jacobi-PC algorithm always converges globally to a stationary point in the general setting. In particular, for cases $d=3,4$ of problem Eq. 1, Jacobi-PC algorithm allows for a simple algebraic solution to find the optimal Jacobi rotation, as in the Jacobi CoM algorithm [6, 7, 8]. In addition, this algorithm does not need the order of Jacobi rotations introduced in Jacobi-G algorithm. Finally, the global convergence of Jacobi-PC algorithm is proved for sufficiently smooth functions. Therefore, these results may be applied to other tensor approximation problems (e.g., nonsymmetric orthogonal diagonalization [23],or Tucker approximation of symmetric [17] or antisymmetric [3] tensors).

Organization

The paper is organized as follows. In Sections 2 to 4, we present some more or less known results or easy extensions. In Sections 5 to 7, we show our main results of this paper. In Section 2, we introduce the abstract optimization problem on $\mathscr{SO}_{n}$ and the general Jacobi algorithm to solve this abstract problem. In Section 3, we recall the Jacobi-G algorithm proposed in [17] and its local convergence properties in the general setting. In Section 4, we list some properties that are specific to orthogonal tensor diagonalization problem. In Section 5, we prove the global convergence of Jacobi-G algorithms for simultaneous orthogonal diagonalization of matrices and 3rd-rder tensors. In Section 6, we propose the Jacobi-PC algorithm and prove its global convergence. We derive formulas for optimal Jacobi rotations in the case of 3rd and 4th-order tensors. In Section 7, we provide some numerical experiments.

2 Problem statement and Jacobi rotations

2.1 Abstract optimization problem

Let $\textit{f}:\mathscr{SO}_{n}\to\mathbb{R}$ be a function. The abstract optimization problem is to find $\boldsymbol{Q}^{*}\in\mathscr{SO}_{n}$ such that

[TABLE]

Example 1.

Problem Eq. 1 in Section 1 is a special case of problem Eq. 2. Because $\|\boldsymbol{\mathcal{A}}^{(\ell)}\|=\|\boldsymbol{\mathcal{W}}^{(\ell)}\|$ for any $1\leq\ell\leq m$ , problem Eq. 1 is equivalent to find

[TABLE]

*where $\mathop{\operator@font offdiag}\{\boldsymbol{\mathcal{W}}^{(\ell)}\}$ is the vector of elements in $\boldsymbol{\mathcal{W}}^{(\ell)}$ except diagonal elements. *

Example 2.

In [17], the best low multilinear rank approximation problem for 3rd order symmetric tensors was formulated as a special case of problem Eq. 2. In fact, based on [15, Theorem 4.1], the cost function has the form

[TABLE]

*where $r<n$ and $\boldsymbol{M}=\begin{bmatrix}I_{r}&0\\ 0&0\end{bmatrix}\in\mathbb{R}^{n\times n}$ . *

Example 3.

*Under some assumptions, it was shown that the orthogonal tensor diagonalization problem could be solved in the sense of maximization of the trace of a tensor [12], and is also a special case of problem Eq. 2. *

2.2 Givens rotations and the general Jacobi algorithm

Let $\theta\in\mathbb{R}$ be an angle and $(i,j)$ be a pair of indices with $1\leq i<j\leq n$ . We denote the Givens rotation matrix by

[TABLE]

i.e., the matrix defined by

[TABLE]

for $1\leq k,l\leq n$ . We summarize the general Jacobi algorithm in Algorithm 1.

Note that in Algorithm 1 (and all the following algorithms, unless mentioned explicitly) we do not specify a stopping criterion, since our goal is to analyse the whole sequence of iterations $\{\boldsymbol{Q}_{k}\}_{k\geq 1}$ produced by the algorithm. The stopping criterion applies only to software implementation of algorithms.

Algorithm 1 performs the sequence of iterations $\boldsymbol{Q}_{k}$ by multiplicative updates:

[TABLE]

where each $\boldsymbol{U}_{k}$ is an elementary rotation. The advantage of this Jacobi-type algorithm is that each update is a one-dimensional optimization problem, which can be solved efficiently.

Remark 2.1.

Note that the choice of pair $(i_{k},j_{k})$ in every iteration is not specified in Algorithm 1. One of the most natural rules is in cyclic fashion as follows.

[TABLE]

*We call the Jacobi algorithm with this cyclic-by-row rule the Jacobi-C algorithm. *

In Jacobi-C algorithm, the choice of pairs is periodic with the period $n(n-1)/2$ . Each set of $n(n-1)/2$ iterations is called a sweep. The pair selection rule Eq. 6 was used in the Jacobi CoM algorithm [6, 7, 8] and JADE algorithm [4].

Remark 2.2.

*If several equivalent maximizers are present in Eq. 4, also in all the following algorithms, we choose the one with the angle of smaller magnitude. *

3 Jacobi-G algorithm

In this section, we recall the Jacobi-based algorithm in [17] and its properties. For simplicity, this algorithm is called the Jacobi-G (gradient-based Jacobi-type) algorithm in our paper.

3.1 Technical definitions

Define the matrix

[TABLE]

and introduce the notation

[TABLE]

for $\boldsymbol{Q}\in\mathscr{O}_{n}$ . Next, for a differentiable function $f:\mathscr{O}_{n}\to\mathbb{R}$ , we define the projected gradient [17, Lemmma 5.1] as

[TABLE]

where

[TABLE]

and $\nabla\textit{f}(\boldsymbol{Q})$ is the Euclidean gradient of $f$ as a function of the matrix argument. The projected gradient $\mathop{{\operator@font Proj}\nabla}f(\boldsymbol{Q})$ is exactly the Riemannian gradient if $\mathscr{O}_{n}$ is viewed as an embedded submanifold of $\mathbb{R}^{n\times n}$ [2].

3.2 Jacobi-G algorithm

In [17], a modification of Algorithm 1 was proposed, that choose a pair $(i,j)$ at each iteration that satisfies

[TABLE]

where $\varepsilon$ is a small positive constant.

Note that, as shown in [17, Proof of Lemma 5.3] the left hand side of Eq. 9 can be also written in terms of the function $h_{k}$ :

[TABLE]

The following lemma (an easy generalisation of [17, Lemma 5.2]) shows that it is always possible to choose such a pair $(i_{k},j_{k})$ .

Lemma 3.1.

*For any differentiable function $f:\mathscr{SO}_{n}\to\mathbb{R}$ , $\boldsymbol{Q}\in\mathscr{SO}_{n}$ and $0<\varepsilon\leq 2/n$ , it is always possible to find $(i,j)$ , with $i<j$ , such that (9) holds. *

Proof 3.2.

First of all, thanks to the representation (7), rotation invariance of the Euclidean norm and $\boldsymbol{\Lambda}(\boldsymbol{Q})$ being skew-symmetric, we have that the condition (9) is equivalent to

[TABLE]

Now we choose the pair $(i,j)$ that maximizes the left-hand side of (11)

[TABLE]

Since $\boldsymbol{\Lambda}(\boldsymbol{Q})$ is skew-symmetric, we can choose such a pair with $i<j$ . Finally, by the well known inequality between matrix norms,

[TABLE]

Remark 3.3.

Lemma 3.1* was proved in [17, Lemma 5.2] only for the cost function Eq. 3, although it is valid for any differentiable function. Hence, the convergence results of [17] are valid for a wide class of functions (see the following theorem). *

Theorem 3.4 ([17, Theorem 5.4] combined with Lemma 3.1).

*Let $f$ be a $C^{\infty}$ function. Then every accumulation point222i.e., the limit of every convergent subsequence. $\boldsymbol{Q}_{*}$ of the sequence $\{\boldsymbol{Q}_{k}\}_{k\geq 1}$ produced by Algorithm 2 is a stationary point of the function $f$ (i.e, $\mathop{{\operator@font Proj}\nabla}f(\boldsymbol{Q}_{*})=0$ ). *

3.3 Variants of the Jacobi-G algorithm

Note that the description of Algorithm 2 does not say precisely how to select the pairs $(i_{k},j_{k})$ that satisfy Eq. 9. The first option is suggested by the proof in Lemma 3.1: set $\varepsilon=\frac{2}{n}$ and select the maximal element in Eq. 8. We summarize this option in Algorithm 3.

However, choosing the maximal element in Eq. 8 requires a search over all the elements, which may take additional time. In [17], it was suggested to take $\varepsilon\ll 1$ . If we choose $(i_{k},j_{k})$ in the cyclic order and $\varepsilon$ is very small, then it is natural to expect that the inequality Eq. 9 will be often satisfied, and thus the behavior of the algorithm will be very close to the behavior of the Jacobi-C algorithm.

To make this idea more rigorous, we construct a modification of Jacobi-C algorithm that skips the rotations if the magnitude of the directional derivative is below a given threshold. The modification is described in Algorithm 4.

Note that the output of Algorithm 4 is different from other algorithms in terms of the output, because the rule of skipping the rotations also yields a well-defined stopping criterion. Also, note that, compared to Jacobi-G, the inequality Eq. 9 is not needed. But, Algorithm 4 can be viewed as a special case of Algorithm 2, as shown by the following remark.

Remark 3.5.

There exists $\varepsilon=\varepsilon(\delta)$ such that the iterates $\boldsymbol{Q}_{k}$ produced by Algorithm 4 are the first elements of the sequence produced by a Jacobi-G algorithm (Algorithm 2). Indeed, consider $c=\max_{\boldsymbol{Q}\in\mathscr{SO}_{n}}\|\mathop{{\operator@font Proj}\nabla}f({\boldsymbol{Q}})\|$ , which is finite if $f$ is smooth (since $\mathscr{SO}_{n}$ is compact). Hence $\varepsilon=\frac{\delta}{nc}$ is the required value of $\varepsilon$ .

*Also note that, Algorithm 4 always terminates if the Jacobi-G algorithm converges to a stationary point. *

4 Derivatives in the orthogonal tensor diagonalization problems

4.1 Projected gradient: the case of a single tensor ( $m=1$ )

In this subsection, we derive a concrete form of projected gradient Eq. 7, which will be used in Section 5. Let $\boldsymbol{\mathcal{A}}\in\mathbb{R}^{n\times\cdots\times n}$ be a $d$ th-order symmetric tensor and $\boldsymbol{Q}\in\mathscr{O}_{n}$ be an orthogonal matrix. Let $\boldsymbol{\mathcal{W}}=\boldsymbol{\mathcal{A}}\mathop{\bullet_{1}}\boldsymbol{Q}^{{\sf T}}\cdots\mathop{\bullet_{d}}\boldsymbol{Q}^{{\sf T}}$ and

[TABLE]

First, we calculate the Euclidean gradient of f at $\boldsymbol{Q}$ . Let us fix $i$ and $j$ . Then

[TABLE]

where $\boldsymbol{\mathcal{V}}=\boldsymbol{\mathcal{A}}\mathop{\bullet_{2}}\boldsymbol{Q}^{{\sf T}}\cdots\mathop{\bullet_{d}}\boldsymbol{Q}^{{\sf T}}$ . Noting that $\boldsymbol{\mathcal{V}}=\boldsymbol{\mathcal{W}}\mathop{\bullet_{1}}\boldsymbol{Q}$ , we get that

[TABLE]

After projecting $\nabla\textit{f}(\boldsymbol{Q})$ onto the tangent space at $\boldsymbol{Q}$ to the manifold $\mathscr{O}_{n}$ , we get

[TABLE]

where $\boldsymbol{\Lambda}(\boldsymbol{Q})$ is the matrix with

[TABLE]

for any $1\leq k,l\leq n$ . This is a special case of Eq. 8 for function Eq. 12.

Remark 4.1.

Let $\boldsymbol{\mathcal{A}}$ be a 3rd order symmetric tensor. Then $\mathop{{\operator@font Proj}\nabla}f(\boldsymbol{Q})=$

[TABLE]

4.2 The cost function at each iteration

Now consider a single iteration in Algorithm 1 for the cost function (12) . For simplicity333other cases follow by substitution of indices, we consider only the case $(i_{k},j_{k})=(1,2)$ . We also denote $\boldsymbol{G}(\theta)\stackrel{{\scriptstyle\sf def}}{{=}}\boldsymbol{G}^{(1,2,\theta)}$ without loss of generality. Then $\boldsymbol{Q}_{k}=\boldsymbol{Q}_{k-1}\boldsymbol{G}(\theta^{*}_{k})$ .

In this subsection, we take $\boldsymbol{Q}=\boldsymbol{Q}_{k-1}$ , so that

[TABLE]

is the rotated tensor before the $k$ th iteration. We also use notation $\boldsymbol{\mathcal{T}}$ for the candidate tensors after $k$ -th iteration, i.e.

[TABLE]

so that $h_{k}(\theta)=\|\mathop{\operator@font diag}\{\boldsymbol{\mathcal{T}}(\theta)\}\|^{2}$ . Note that we omit $k$ in the notation for $\boldsymbol{\mathcal{W}}$ and $\boldsymbol{\mathcal{T}}$ , but this will not lead to confusion within one iteration.

Although $\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)$ is defined on the whole real line, it is periodic. Apart from the obvious period $2\pi$ , it has a smaller period $\pi/2$ . Indeed,

[TABLE]

Hence, the tensor $\boldsymbol{\mathcal{T}}(\theta+\pi/2)$ differs from $\boldsymbol{\mathcal{T}}(\theta)$ by permutations of the indices $1$ and $2$ and change of signs, which does not change the cost function. Hence, due to Remark 2.2, we are, in fact, maximizing $\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)$ on the interval $[-\pi/4,\pi/4]$ . In fact, in all the algorithms we choose $\color[rgb]{0,0,0}\theta^{*}_{k}\color[rgb]{0,0,0}\in[-\pi/4,\pi/4]$ with $\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\color[rgb]{0,0,0}\theta^{*}_{k}\color[rgb]{0,0,0})=\max\limits_{\theta\in\mathbb{R}}\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)$ .

It is often convenient to rewrite the function $\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)$ in Eq. 17 in a polynomial form. Consider the change of variables $\theta=\arctan(x)$ . Then minimization of $\color[rgb]{0,0,0}h_{k}\color[rgb]{0,0,0}(\theta)$ on $[-\pi/4,\pi/4]$ is equivalent to minimization of

[TABLE]

on $[-1,1]$ . After the change of variables, since $x=\tan\theta$ , we have

[TABLE]

Hence, the cost function is a rational function in $x$

[TABLE]

where $\rho(x)$ is a polynomial of degree $2d$ (note that we dropped the index $k$ for simplicity).

In [8, eqn. (22)-(23)] it was shown that finding critical points of $\tau_{k}(x)$ can be reduced to finding roots of a quadratic polynomial if $d=3$ or a quartic polynomial if $d=4$ . We do not provide these expressions in this section, but their generalizations can be found in Section 6.2.

4.3 Derivatives for matrices and third-order tensors

We recall the expressions for derivatives of the cost function (12) in the case $d\in\{2,3\}$ from [8], but give proofs for completeness.

Lemma 4.2 ([8, eqn. (22)-(23)]).

In the notations of Section 4.2, the derivatives of $h_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}$ have the following form.

•

*For $d=3$ : *

[TABLE]

•

For $d=2$ :

[TABLE]

*The expressions for $h^{{}^{\prime}}_{k}(0)$ and $h^{{}^{\prime\prime}}_{k}(0)$ can be found by substituting $\boldsymbol{\mathcal{T}}$ to $\boldsymbol{\mathcal{W}}$ in the above expressions. *

Proof 4.3.

From (13) we have

[TABLE]

where $\Lambda$ is defined in Eq. 13.

After straightforward differentiation, we have that for $d=3$

[TABLE]

and for $d=2$

[TABLE]

*The equation for $h^{{}^{\prime\prime}}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(\theta)$ follows by substitution. *

4.4 The general case ( $m>1$ )

In the general case, the cost function in problem Eq. 1 is

[TABLE]

where $\color[rgb]{0,0,0}{f}^{\color[rgb]{0,0,0}(\ell)\color[rgb]{0,0,0}}(\boldsymbol{Q})=\|\mathop{\operator@font diag}\{\boldsymbol{\mathcal{W}}^{(\ell)}\}\|^{2}$ for any $1\leq\ell\leq m$ .

By linearity, the cost function $h_{k}(\theta)$ in Eq. 4 in every iteration can be conveniently written as

[TABLE]

where

[TABLE]

and $\boldsymbol{\mathcal{T}}^{(\ell)}(\theta)$ is defined in the same way as $\boldsymbol{\mathcal{T}}$ in Section 4.2. Therefore, the derivatives of $\color[rgb]{0,0,0}h_{k}^{(\ell)}\color[rgb]{0,0,0}$ can be obtained in the same way as in Section 4.3

Finally, as in Section 4.2, we can use a change of variables

[TABLE]

which leads to

[TABLE]

where, for $1\leq\ell\leq m$ ,

[TABLE]

5 Global convergence of Jacobi-G algorithm for symmetric low-order tensors

5.1 Łojasiewicz gradient inequality

In this subsection, we recall some important results about the convergence of iterative algorithms. The discrete-time analogue of classical Łojasiewicz’s theorem was proposed in [1], and make it possible to prove the convergence of many algorithms [26, 27]. In [25], to prove the convergence of projected line-search methods on the real-algebraic variety of real $m\times n$ matrices of rank at most $k$ , the optimization problem

[TABLE]

on a closed subset $\mathcal{M}\subseteq\mathbb{R}^{n}$ was considered. Suppose that the tangent cone ${T}_{x}\mathcal{M}$ at $x\in\mathcal{M}$ is a linear space. Let $\mathop{{\operator@font Proj}\nabla}f(x)$ be the projection of the Euclidean gradient $\nabla\textit{f}(x)$ on the tangent space at $x$ . We first introduce the definition of an analytic submanifold in $\mathbb{R}^{n}$ .

Definition 5.1 ([21, Def. 2.7.1]).

A set $\mathcal{M}\subseteq\mathbb{R}^{n}$ is called an $m$ -dimensional real analytic submanifold if, for each $p\in\mathcal{M}$ , there exists an open subset $\mathcal{U}\subseteq\mathbb{R}^{m}$ and a real analytic function $f:\mathcal{U}\rightarrow\mathbb{R}^{n}$ which maps open subsets of $\mathcal{U}$ onto relatively open subsets of $\mathcal{M}$ and which is such that

[TABLE]

*where $\boldsymbol{J}_{f}(u)$ is the Jacobian matrix of $f$ at $u$ . *

The following results were proved in [25].

Lemma 5.2.

Let $\mathcal{M}\subseteq\mathbb{R}^{n}$ be an analytic submanifold. Then any point $x\in\mathcal{M}$ satisfies a Łojasiewicz inequality for $\mathop{{\operator@font Proj}\nabla}f(x)$ , that is, there exist $\delta>0$ , $\sigma>0$ and $\zeta\in(0,1/2]$ such that for all $y\in\mathcal{M}$ with $\|y-x\|<\delta$ , it holds that

[TABLE]

Theorem 5.3 ([25, Theorem 2.3]).

*Let $\mathcal{M}\subseteq\mathbb{R}^{n}$ be an analytic submanifold and $\{x_{k}:k\in\mathbb{N}\}\subset\mathcal{M}$ be a sequence. Suppose that $f$ is real analytic and, for large enough $k$ ,

(i) there exists $\sigma>0$ such that*

[TABLE]

*(ii) $\mathop{{\operator@font Proj}\nabla}f(x_{k})=0$ implies that $x_{k+1}=x_{k}$ .

Then any accumulation point of $\{x_{k}:k\in\mathbb{N}\}\subseteq\mathcal{M}$ is the only limit point. *

Now we apply Theorem 5.3 to the compact orthogonal group $\mathscr{O}_{n}\subset\mathbb{R}^{n\times n}$ and get Corollary 5.4, which will allow us to prove the global convergence of Jacobi-G algorithm in Section 5.2.

Corollary 5.4.

*Let f be real analytic in Algorithm 1. Suppose that, for large enough k,

(i) there exists $\sigma>0$ such that*

[TABLE]

*(ii) $\mathop{{\operator@font Proj}\nabla}f(\boldsymbol{Q}_{k-1})=0$ implies that $\boldsymbol{Q}_{k}=\boldsymbol{Q}_{k-1}$ .

Then the iterations $\{\boldsymbol{Q}_{k}:k\in\mathbb{N}\}$ converge to a point $\boldsymbol{Q}_{*}\in\mathscr{O}_{n}$ . *

Remark 5.5.

*Under the same assumptions as in Corollary 5.4, Theorem 3.4 tells us that Algorithm 2 converges to a stationary point. *

5.2 Global convergence of Jacobi-G algorithm for matrices and 3rd-order tensors

In this section, we consider the case $d\in\{2,3\}$ , that is, one of the following options:

•

for a set of 3rd-order symmetric tensors $\{\boldsymbol{\mathcal{A}}^{(\ell)}:1\leq\ell\leq m\}\subseteq\mathbb{R}^{n\times n\times n}$ , the cost function is

[TABLE]

•

for a set $\{\boldsymbol{A}^{(\ell)}:1\leq\ell\leq m\}\subseteq\mathbb{R}^{n\times n}$ of symmetric matrices, the cost function is

[TABLE]

Theorem 5.6.

*For the cost function Eq. 20 or Eq. 21, Algorithm 2 converges to a stationary point of $f$ in $\mathscr{O}_{n}$ , for any starting point $\boldsymbol{Q}_{0}$ . *

Remark 5.7.

*In the case $m=1$ and $d=3$ , this is the Jacobi-G algorithm for orthogonal diagonalization of 3rd order symmetric tensors. Theorem 5.6 shows the global convergence of this algorithm. *

Before giving the proof of Theorem 5.6, we formulate several lemmas.

Lemma 5.8.

In the case $d\in\{2,3\}$ , for the cost function $\tau_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(x)$ defined in Eq. 18, the following identities hold true

[TABLE]

Proof 5.9.

First, by linearity of the expressions Eq. 22 and Eq. 23, and from Section 4.4 we can prove the identities only for the case of a single tensor or matrix (i.e., the cost function (12)). Second, the equality Eq. 23 follows from Eq. 22 by straightforward differentiation and the fact that

[TABLE]

Hence, we are left to prove Eq. 22 for the cost function (12).

Recall the notation of Section 4.2, and consider the case $d=3$ . By substitution Eq. 14, and due to the fact that the rotation affects only first two elements on the diagonal, we get that

[TABLE]

where the last equality follows from Lemma 4.2.

The case $d=2$ is analogous: we have

[TABLE]

*where the last equality follows again from Lemma 4.2. *

Lemma 5.10.

In each iteration of Algorithm 2 for the cost function Eq. 20 or Eq. 21, the following inequality holds true

[TABLE]

*for any $k\in\mathbb{N}$ . *

Proof 5.11.

At each iteration, from Eq. 9 and Eq. 10 we have

[TABLE]

Next, for an optimal angle $\theta_{*}\color[rgb]{0,0,0}=\theta^{*}_{k}\color[rgb]{0,0,0}$ , we have

[TABLE]

Note that the tangent $x_{*}=\tan(\theta_{*})$ should satisfy equation $\tau^{{}^{\prime}}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(x_{*})=0$ .

If $x_{*}=0$ or $\pm 1$ , then $h^{{}^{\prime}}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(0)=0$ from Eq. 23, hence $\|\mathop{{\operator@font Proj}\nabla}\textit{f}(\boldsymbol{Q}_{k-1})\|=0$ from Eq. 24 and the result is obvious. Consider the case $0<|x_{*}|<1$ . Then from Eq. 23 we get

[TABLE]

and thus

[TABLE]

by substituting Eq. 26 into (22). Finally, by combining Eqs. 24 to 27, we get

[TABLE]

Proof 5.12 (Proof of Theorem 5.6).

Lemma 5.10* guarantees that condition Eq. 19 in Corollary 5.4 holds true. Since the cost function is analytic, by Corollary 5.4, the sequence $\boldsymbol{Q}_{k}$ converges to a stationary point $\boldsymbol{Q}_{*}$ . Finally, by Theorem 3.4, $\boldsymbol{Q}_{*}$ is a stationary point of $f$ in Eq. 20. *

6 Jacobi-PC algorithm and its global convergence

The Jacobi-G algorithm has several disadvantages: the convergence for $4$ th-order tensors is currently unknown, and the parameter $\varepsilon$ needs to be chosen in a proper way. In this section, we propose an new Jacobi-based algorithm, which is inspired by proximal algorithms in convex [24] and nonconvex [Bolte14:Proximal] optimization .

6.1 Jacobi-PC algorithm and its global convergence

Suppose that we are given a twice continuously differentiable function $f:\mathscr{SO}_{n}\to\mathbb{R}$ , such that

[TABLE]

for any $\boldsymbol{Q}\in\mathscr{SO}_{n}$ and $1\leq i<j\leq n$ (i.e., it is $\pi/2$ -periodic along any geodesic). Then we propose the Jacobi-PC algorithm (Jacobi-C algorithm with a proximal term) in Algorithm 5.

Remark 6.1.

*The periodicity condition Eq. 28 is not necessary for the global convergence of the algorithm in Theorem 6.2, but we add it due to its presence in the orthogonal tensor diagonalization problem. If the condition Eq. 28 does not hold, another proximal term $\gamma(\theta)$ may be needed. Finally, other pair selection rules than Eq. 6 can be used. *

Theorem 6.2.

*The sequence produced by Algorithm 5 converges to a stationary point $\boldsymbol{Q}_{*}\in\mathscr{O}_{n}$ for any starting point $\boldsymbol{Q}_{0}$ . *

Proof 6.3.

We first prove the convergence. Since

[TABLE]

we get that

[TABLE]

Note that $f(\boldsymbol{Q}_{k})$ is bounded since $\mathscr{O}_{n}$ is compact. Then $f(\boldsymbol{Q}_{k})\rightarrow c<+\infty$ and thus

[TABLE]

By Eq. 29, we have that $\gamma(\theta^{*}_{k})\rightarrow 0.$ Note that $\gamma(\theta)\geq 8|\theta|^{2}/\pi^{2}$ for $\theta\in[-\pi/4,\pi/4]$ . Then $\theta^{*}_{k}\to 0$ and thus there exists $\boldsymbol{Q}_{*}\in\mathscr{O}_{n}$ such that $\boldsymbol{Q}_{k}\to\boldsymbol{Q}_{*}$ .

Next we prove that $\tilde{h}^{{}^{\prime}}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(0)\rightarrow 0$ , that is $\color[rgb]{0,0,0}\Lambda_{i_{k},j_{k}}\color[rgb]{0,0,0}(\boldsymbol{Q}_{k-1})\rightarrow 0$ . Define

[TABLE]

for $\theta\in\mathbb{R}$ , $\boldsymbol{Q}\in\mathscr{O}_{n}$ and $1\leq i<j\leq n$ . Let

[TABLE]

Then $M_{1}<+\infty$ since $f$ is C2 smooth, $\bar{h}$ is periodic with respect to $\theta$ and $\mathscr{O}_{n}$ is compact. Therefore, we have that

[TABLE]

for any $\boldsymbol{Q}_{k-1}\in\mathscr{O}_{n}$ , $\theta^{*}_{k}\in\mathbb{R}$ and $1\leq i_{k}<j_{k}\leq n$ , and thus $\tilde{h}^{{}^{\prime}}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(0)\rightarrow 0$ .

Finally we prove that $\boldsymbol{Q}_{*}\in\mathscr{O}_{n}$ is a stationary point of f, that is $\boldsymbol{\Lambda}(\boldsymbol{Q}_{k-1})\rightarrow 0$ . We have proved that $\color[rgb]{0,0,0}\Lambda_{i_{k},j_{k}}\color[rgb]{0,0,0}(\boldsymbol{Q}_{k-1})\rightarrow 0$ in the above part. Now we prove other entries of $\boldsymbol{\Lambda}(\boldsymbol{Q}_{k-1})$ also converge to 0. For simplicity, take $(i_{k},j_{k})=(1,2)$ and $(i_{k+1},j_{k+1})=(1,3)$ for instance. It is enough to prove that $\color[rgb]{0,0,0}\Lambda_{1,2}\color[rgb]{0,0,0}(\boldsymbol{Q}_{k})\rightarrow 0$ . In fact, if we define

[TABLE]

then $\color[rgb]{0,0,0}\Lambda_{1,2}\color[rgb]{0,0,0}(\boldsymbol{Q}_{k})=\phi(\theta^{*}_{k})$ and $\color[rgb]{0,0,0}\Lambda_{1,2}\color[rgb]{0,0,0}(\boldsymbol{Q}_{k-1})=\phi(0)$ . Define

[TABLE]

for $\theta\in\mathbb{R}$ , $\boldsymbol{Q}\in\mathscr{O}_{n}$ and $1\leq i<j\leq n$ . Let

[TABLE]

Then $M_{2}<+\infty$ since $\bar{\phi}$ is smooth and periodic. Therefore

[TABLE]

*and thus $\color[rgb]{0,0,0}\Lambda_{1,2}\color[rgb]{0,0,0}(\boldsymbol{Q}_{k})\rightarrow 0$ . *

6.2 Elementary rotations for orthogonal tensor diagonalization

The cost function Eq. 16 in simultaneous orthogonal tensor diagonalizatiom has the property Eq. 28, hence the Jacobi-PC algorithm is guaranteed to converge. Moreover, it allows for finding the update using an algebraic algorithm in the cases $d=3,4$ .

Let us show how to find $\theta^{*}_{k}$ in every iteration of Algorithm 5. Let $\widetilde{\tau}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(x)=\widetilde{h}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(\arctan x)$ be as in Section 4.2. Then we obtain that

[TABLE]

where $\rho(x)$ is the polynomial defined in Eq. 15. Then $\widetilde{\tau}^{{}^{\prime}}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(x)=0$ is equivalent to $\omega(x)=0$ , where

[TABLE]

is a polynomial of degree $2d$ .

Note that from Section 4.2 and $\frac{\pi}{2}$ -periodicity of $\gamma$ , we have that $\widetilde{h}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(\theta)=\widetilde{h}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(\theta+\pi/2)$ for any $\theta\in\mathbb{R}$ , hence $\widetilde{\tau}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(x)=\widetilde{\tau}_{\color[rgb]{0,0,0}k\color[rgb]{0,0,0}}(-1/x)$ . Now we represent the algebraic solutions of $\omega(x)$ by this property, that is, $\omega(x)=0$ has the same solutions as $\omega(-1/x)=0$ except the possible roots at the origin. Let $\xi=x-1/x$ . Then

[TABLE]

for some $0\leq d_{1},d_{2}\leq d$ . Now we have that $\omega(x)=0$ if and only if

[TABLE]

except the possible roots at the origin. If the algebraic roots $\xi_{\color[rgb]{0,0,0}j\color[rgb]{0,0,0}}$ can be calculated then the roots $(x_{\color[rgb]{0,0,0}j\color[rgb]{0,0,0}},-1/x_{\color[rgb]{0,0,0}j\color[rgb]{0,0,0}})$ could be deduced by rooting the polynomials $x^{2}-\xi_{\color[rgb]{0,0,0}j\color[rgb]{0,0,0}}x-1=0$ .

Now we restrict ourselves to the case of a single tensor (i.e. the cost function Eq. 12). In fact, if $\boldsymbol{\mathcal{A}}$ is of 3rd or 4th-order, it can be shown that $\Omega(\xi)$ has algebraic solutions and thus $(x_{\color[rgb]{0,0,0}j\color[rgb]{0,0,0}},-1/x_{\color[rgb]{0,0,0}j\color[rgb]{0,0,0}})$ can be determined. The following Lemma 6.4 provides the specific form of $\Omega(\xi)$ in these cases, and is a direct generalisation of the results the ordinary Jacobi algorithm in [8, Appendix].

Lemma 6.4.

(i) Let $\boldsymbol{\mathcal{A}}\in\mathbb{R}^{n\times n\times n}$ be a 3rd order symmetric tensor and

[TABLE]

Then

[TABLE]

(ii) Let $\boldsymbol{\mathcal{A}}\in\mathbb{R}^{n\times n\times n\times n}$ be a 4th order symmetric tensor and

[TABLE]

Then

[TABLE]

Note that if we set $\delta_{0}=0$ , we obtain exactly the expressions from [8, Appendix].

Remark 6.5.

*The expressions for $\Omega(\xi)$ in the case of simultaneous orthogonal diagonalization problem can be also easily found in the same way as in Lemma 6.4, by exploiting the additivity of the corresponding expressions in Section 4.4. *

7 Numerical results

In this section, we present numerical experiments in order to compare the presented algorithms in the case of orthogonal diagonalization problems for symmetric tensors. The algorithms were implemented in MATLAB and the codes are available on request.

The setup of all the experiments is as follows:

•

A diagonal tensor $\boldsymbol{\mathcal{D}}$ is chosen. (For convenience, we choose the tensors such that $\|\boldsymbol{\mathcal{D}}\|=1$ .)

•

A random rotation matrix $\boldsymbol{Q}$ is applied to obtain

[TABLE]

•

The test tensor is constructed as $\boldsymbol{\mathcal{A}}=\boldsymbol{\mathcal{A}}_{0}+\boldsymbol{\mathcal{E}}$ , where $\boldsymbol{\mathcal{E}}$ is the symmetrization of a tensor containing realization of i.i.d. Gaussian noise with variance $\sigma^{2}$ .

To each test example we apply the following algorithms:

•

Jacobi-C: Algorithm 1 with the order of pairs Eq. 6.

•

Jacobi-G-max: Algorithm 3.

•

Jacobi-G: Algorithm 2 for various values of $\varepsilon$ .

•

Jacobi-PC: Algorithm 5 for various values of $\delta_{0}$ (shortened to “Jacobi-P” in the plots).

The stopping criterion is chosen to be the maximum number of iteration.

In each of the plots, we plot $\|\boldsymbol{\mathcal{A}}\|^{2}-f(\boldsymbol{Q}_{k})$ , which is exactly the squared norm of the off-diagonal elements. In all the plots, the markers correspond to the places where the new sweep starts.

7.1 Test 1: equal values on the diagonal

In this subsection, we consider $10\times 10\times 10$ and $10\times 10\times 10\times 10$ tensors where the diagonal values are given by

[TABLE]

We plot the results in Figs. 1 and 2.

As we see in Figs. 1 and 2, in all the examples all the methods converge to the same cost function value. We observe that the behavior of the Jacobi-PC algorithm it not too different from the behavior of the Jacobi-C algorithm.

The convergence of Jacobi-G-max is the fastest, but the difference is marginal. Also, the Jacobi-G-max is typically slower in the beginning, but accelerates when the algorithm is closer to the local maximum. Finally, if $\varepsilon$ is small, the behavior of Jacobi-G is almost indistinguishable from Jacobi-C, as pointed out in Remark 3.5.

7.2 Test 2: different values on the diagonal

In this subsection, we consider $10\times 10\times 10$ and $10\times 10\times 10\times 10$ tensors where the diagonal values are given by

[TABLE]

We plot the results in Figs. 3 and 4.

In Figs. 3 and 4 we see that this scenario is less favorable for Jacobi-PC: if the value of $\delta_{0}$ is too high, then it slows down the convergence of the algorithm. We also see that typically the Jacobi-G algorithms are the fastest, but the difference with Jacobi-C is not significant again. Also for small values of $\varepsilon$ , the behavior of Jacobi-G resembles the behavior of Jacobi-C.

7.3 High noise and local minima

In this subsection, we consider the case of high noise. We repeat only the $4$ th-order experiments (for a single tensor) from Section 7.2 except with $\sigma=10^{-1}$ . We take two different realizations of $\boldsymbol{\mathcal{E}}$ and plot the results in Fig. 5.

In Fig. 5, we see that the behavior of the algorithms is more erratic, and they may converge to different cost function values. This is explained by the non-convexity of the problem and presence of different local minima, which is typical for the tensor approximation problems [18]. Next, the Jacobi-G-max algorithm here has the worst performance. This is also explained well by the non-convexity of the problem, because the compatibility of the Jacobi rotation with the gradient (eqn. (9)) may not be optimal.

7.4 Simultaneous diagonalization

We conclude the section by a small example of simultaneous diagonalization. We take 4th order $10\times 10\times 10\times 10$ tensor $\boldsymbol{\mathcal{A}}$ generated as in Sections 7.1 to 7.2 (for the noise level $\sigma=10^{-2}$ ), and consider its $10$ slices $\boldsymbol{\mathcal{B}}^{(1)},\ldots,\boldsymbol{\mathcal{B}}^{(10)}\in\mathbb{R}^{n\times n\times n}$ along the last dimension, i.e.

[TABLE]

Then, we perform the joint diagonalization of tensors $\boldsymbol{\mathcal{B}}^{(1)},\ldots,\boldsymbol{\mathcal{B}}^{(m)}$ (for $m=10$ ) and run the same algorithms as in the previous experiments, but for the cost function in the case of simultaneous diagonalization. The results are plotted in Fig. 6.

The results in Fig. 6 exhibit a similar behavior to the results in Sections 7.1 to 7.2. When comparing the results of single tensor diagonalization for the same tensors, (see Fig. 2 and Fig. 4, subfigures (b)), we can see that the results are comparable, and even the simultaneous diagonalization may yield a slightly higher cost function value. But the cost function in this case is different because the tensor is not rotated along the last mode.

8 Conclusions

We showed that by modifying the well-known Jacobi CoM algorithm [6, 8] for orthogonal symmetric tensor diagonalization problem, it is possible to prove its global convergence. The global convergence of Jacobi-G algorithm [17] is proved for the case of simultaneous orthogonal symmetric matrix (or 3rd-order tensor) diagonalization. The global convergence for 4th-order case is still unknown. Our new proximal-type algorithm Jacobi-PC is globally convergent for a wide range of optimization problems, and shows a good performance in the numerical experiments.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. A. Absil, R. Mahony, and B. Andrews , Convergence of the iterates of descent methods for analytic cost functions , SIAM Journal on Optimization, 16 (2005), pp. 531–547.
2[2] P.-A. Absil, R. Mahony, and R. Sepulchre , Optimization Algorithms on Matrix Manifolds , Princeton University Press, Princeton, NJ, 2008.
3[3] E. Begovic and D. Kressner , Structure-preserving low multilinear rank approximation of antisymmetric tensors , Ar Xiv e-prints, (2016), https://arxiv.org/abs/1603.05010 .
4[4] J. Cardoso and A. Souloumiac , Blind beamforming for non-gaussian signals , IEE Proceedings F (Radar and Signal Processing), 6 (1993), pp. 362–370.
5[5] A. Cichocki, D. Mandic, L. D. Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. PHAN , Tensor decompositions for signal processing applications: From two-way to multiway component analysis , IEEE Signal Processing Magazine, 32 (2015), pp. 145–163.
6[6] P. Comon , Independent Component Analysis , in Higher Order Statistics, J.-L. Lacoume, ed., Elsevier, Amsterdam, London, 1992, pp. 29–38.
7[7] P. Comon , Independent component analysis, a new concept ? , Signal Processing, 36 (1994), pp. 287–314.
8[8] P. Comon , Tensor Diagonalization, A useful Tool in Signal Processing , in 10th IFAC Symposium on System Identification (IFAC-SYSID), M. Blanke and T. Soderstrom, eds., vol. 1, Copenhagen, Denmark, July 1994, IEEE, pp. 77–82.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Globally convergent Jacobi-type algorithms

Abstract

keywords:

1 Introduction

Main notations

Problem statement

Contribution

Organization

2 Problem statement and Jacobi rotations

2.1 Abstract optimization problem

Example 1**.**

Example 2**.**

Example 3**.**

2.2 Givens rotations and the general Jacobi algorithm

Remark 2.1**.**

Remark 2.2**.**

3 Jacobi-G algorithm

3.1 Technical definitions

3.2 Jacobi-G algorithm

Lemma 3.1**.**

Proof 3.2**.**

Remark 3.3**.**

Theorem 3.4** ([17, Theorem 5.4] combined with Lemma 3.1).**

3.3 Variants of the Jacobi-G algorithm

Remark 3.5**.**

4 Derivatives in the orthogonal tensor diagonalization problems

4.1 Projected gradient: the case of a single tensor (m=1m=1m=1)

Remark 4.1**.**

4.2 The cost function at each iteration

4.3 Derivatives for matrices and third-order tensors

Lemma 4.2** ([8, eqn. (22)-(23)]).**

Proof 4.3**.**

4.4 The general case (m>1m>1m>1)

5 Global convergence of Jacobi-G algorithm for symmetric low-order tensors

5.1 Łojasiewicz gradient inequality

Definition 5.1** ([21, Def. 2.7.1]).**

Lemma 5.2**.**

Theorem 5.3** ([25, Theorem 2.3]).**

Corollary 5.4**.**

Remark 5.5**.**

5.2 Global convergence of Jacobi-G algorithm for matrices and 3rd-order tensors

Theorem 5.6**.**

Remark 5.7**.**

Lemma 5.8**.**

Proof 5.9**.**

Lemma 5.10**.**

Proof 5.11**.**

Proof 5.12** (Proof of Theorem 5.6).**

6 Jacobi-PC algorithm and its global convergence

6.1 Jacobi-PC algorithm and its global convergence

Remark 6.1**.**

Theorem 6.2**.**

Proof 6.3**.**

6.2 Elementary rotations for orthogonal tensor diagonalization

Lemma 6.4**.**

Remark 6.5**.**

7 Numerical results

7.1 Test 1: equal values on the diagonal

7.2 Test 2: different values on the diagonal

7.3 High noise and local minima

7.4 Simultaneous diagonalization

8 Conclusions

Example 1.

Example 2.

Example 3.

Remark 2.1.

Remark 2.2.

Lemma 3.1.

Proof 3.2.

Remark 3.3.

Theorem 3.4 ([17, Theorem 5.4] combined with Lemma 3.1).

Remark 3.5.

4.1 Projected gradient: the case of a single tensor ( $m=1$ )

Remark 4.1.

Lemma 4.2 ([8, eqn. (22)-(23)]).

Proof 4.3.

4.4 The general case ( $m>1$ )

Definition 5.1 ([21, Def. 2.7.1]).

Lemma 5.2.

Theorem 5.3 ([25, Theorem 2.3]).

Corollary 5.4.

Remark 5.5.

Theorem 5.6.

Remark 5.7.

Lemma 5.8.

Proof 5.9.

Lemma 5.10.

Proof 5.11.

Proof 5.12 (Proof of Theorem 5.6).

Remark 6.1.

Theorem 6.2.

Proof 6.3.

Lemma 6.4.

Remark 6.5.