On the Rotational Invariant $L_1$-Norm PCA

Sebastian Neumayer; Max Nimmer; Simon Setzer; Gabriele Steidl

arXiv:1902.03840·math.NA·May 27, 2019

On the Rotational Invariant $L_1$-Norm PCA

Sebastian Neumayer, Max Nimmer, Simon Setzer, Gabriele Steidl

PDF

TL;DR

This paper reinterprets the rotational invariant $L_1$-norm PCA as a gradient-based optimization on Grassmannian manifolds, proving convergence to critical points using the Kurdyka-Łojasiewicz property.

Contribution

It provides a novel interpretation of robust $L_1$-norm PCA as a gradient descent on Grassmannian manifolds and establishes convergence results for the entire iterative process.

Findings

01

Reinterpreted $L_1$-norm PCA as a gradient descent on Grassmannian manifolds.

02

Proved convergence of all iterates to a critical point using Kurdyka-Łojasiewicz property.

03

Unified robust PCA methods under a geometric optimization framework.

Abstract

Principal component analysis (PCA) is a powerful tool for dimensionality reduction. Unfortunately, it is sensitive to outliers, so that various robust PCA variants were proposed in the literature. Among them the so-called rotational invariant $L_{1}$ -norm PCA is rather popular. In this paper, we reinterpret this robust method as conditional gradient algorithm and show moreover that it coincides with a gradient descent algorithm on Grassmannian manifolds. Based on this point of view, we prove for the first time convergence of the whole series of iterates to a critical point using the Kurdyka-{\L}ojasiewicz property of the energy functional.

Figures3

Click any figure to enlarge with its caption.

Equations243

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥^{2},

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥^{2},

\hat{A} \in A ^{T} A = I _{K} A \in R ^{d, K} arg min i = 1 \sum N ∥ P_{A}^{⊥} y_{i} ∥^{2},

\hat{A} \in A ^{T} A = I _{K} A \in R ^{d, K} arg min i = 1 \sum N ∥ P_{A}^{⊥} y_{i} ∥^{2},

P_{A}^{⊥} : = I_{d} - A A^{T}

P_{A}^{⊥} : = I_{d} - A A^{T}

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥

\hat{A} \in A ^{T} A = I _{K} A \in R ^{d, K} arg min i = 1 \sum N ∥ P_{A}^{⊥} y_{i} ∥.

\hat{A} \in A ^{T} A = I _{K} A \in R ^{d, K} arg min i = 1 \sum N ∥ P_{A}^{⊥} y_{i} ∥.

S_{d, K} : = {A \in R^{d, K} \mathchar 58 A^{T} A = I_{K}} .

S_{d, K} : = {A \in R^{d, K} \mathchar 58 A^{T} A = I_{K}} .

T_{A} S_{d, K}

T_{A} S_{d, K}

= {H \in R^{d, K} \mathchar 58 H = A X + A_{⊥} Z, X \in R^{K, K} \mbox s k e w sy mm e t r i c, Z \in R^{d - K, K}},

T_{[A]} G_{d, K} : = {A_{⊥} Z \mathchar 58 Z \in R^{d - K, K}} .

T_{[A]} G_{d, K} : = {A_{⊥} Z \mathchar 58 Z \in R^{d - K, K}} .

\langle H_{1},H_{2}\rangle_{[A]}\coloneqq\langle H_{1},H_{2}\rangle_{A}=\operatorname{tr}\Bigl{(}H_{1}^{\mathrm{T}}\bigl{(}I_{K}-\frac{1}{2}AA^{\mathrm{T}}\bigr{)}H_{2}\Bigr{)}=\operatorname{tr}\bigl{(}H_{1}^{\mathrm{T}}H_{2}\bigr{)}=\langle H_{1},H_{2}\rangle_{F}.

\langle H_{1},H_{2}\rangle_{[A]}\coloneqq\langle H_{1},H_{2}\rangle_{A}=\operatorname{tr}\Bigl{(}H_{1}^{\mathrm{T}}\bigl{(}I_{K}-\frac{1}{2}AA^{\mathrm{T}}\bigr{)}H_{2}\Bigr{)}=\operatorname{tr}\bigl{(}H_{1}^{\mathrm{T}}H_{2}\bigr{)}=\langle H_{1},H_{2}\rangle_{F}.

d_{G_{d,K}}\bigl{(}[A_{1}],[A_{2}]\bigr{)}\coloneqq\bigl{\|}A_{1}A_{1}^{\mathrm{T}}-A_{2}A_{2}^{\mathrm{T}}\bigr{\|}_{2},

d_{G_{d,K}}\bigl{(}[A_{1}],[A_{2}]\bigr{)}\coloneqq\bigl{\|}A_{1}A_{1}^{\mathrm{T}}-A_{2}A_{2}^{\mathrm{T}}\bigr{\|}_{2},

Π_{S_{d, K}} (M) = O \in S_{d, K} arg min ∥ M - O ∥^{2} = O \in S_{d, K} arg max ⟨ O, M ⟩ .

Π_{S_{d, K}} (M) = O \in S_{d, K} arg min ∥ M - O ∥^{2} = O \in S_{d, K} arg max ⟨ O, M ⟩ .

Π_{S_{d, K}} (M) = Polar (M) .

Π_{S_{d, K}} (M) = Polar (M) .

E(A)=\sum_{i=1}^{N}E_{i}(A)\coloneqq\sum_{i=1}^{N}\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{i}\|=\sum_{i=1}^{N}\bigl{\|}(I_{d}-AA^{\mathrm{T}})y_{i}\bigr{\|},

E(A)=\sum_{i=1}^{N}E_{i}(A)\coloneqq\sum_{i=1}^{N}\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{i}\|=\sum_{i=1}^{N}\bigl{\|}(I_{d}-AA^{\mathrm{T}})y_{i}\bigr{\|},

F (A) = i = 1 \sum N F_{i} (A) : = i = 1 \sum N y_{i}^{T} P_{A}^{⊥} y_{i} = i = 1 \sum N ∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} .

F (A) = i = 1 \sum N F_{i} (A) : = i = 1 \sum N y_{i}^{T} P_{A}^{⊥} y_{i} = i = 1 \sum N ∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} .

D : = i = 1 ⋂ N D_{i}, D_{i} : = {A \in R^{d, K} \mathchar 58 ∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} \geq 0}

D : = i = 1 ⋂ N D_{i}, D_{i} : = {A \in R^{d, K} \mathchar 58 ∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} \geq 0}

F (A) : = - \infty if A \neq \in D .

F (A) : = - \infty if A \neq \in D .

\mathcal{A}\coloneqq\bigl{\{}A\in S_{d,K}\mathrel{\mathop{\mathchar 58\relax}}\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{i}\|=0\;\mbox{for some}\;i\in\{1,\ldots,N\}\bigr{\}}

\mathcal{A}\coloneqq\bigl{\{}A\in S_{d,K}\mathrel{\mathop{\mathchar 58\relax}}\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{i}\|=0\;\mbox{for some}\;i\in\{1,\ldots,N\}\bigr{\}}

∣ E_{i} (A_{1}) - E_{i} (A_{2}) ∣

∣ E_{i} (A_{1}) - E_{i} (A_{2}) ∣

\displaystyle\leq\bigl{\|}A_{1}A_{1}^{\mathrm{T}}-A_{2}A_{2}^{\mathrm{T}}\bigr{\|}\|y_{i}\|

\displaystyle=\frac{1}{2}\bigl{\|}(A_{1}-A_{2})(A_{1}^{\mathrm{T}}+A_{2}^{\mathrm{T}})+(A_{1}+A_{2})(A_{1}^{\mathrm{T}}-A_{2}^{\mathrm{T}})\bigr{\|}\|y_{i}\|

\leq 2 (∥ A ∥ + ε) ∥ y_{i} ∥∥ A_{1} - A_{2} ∥.

∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} \geq 0

∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} \geq 0

∥ y_{i} ∥^{2} - ∥ (λ A_{1} + (1 - λ) A_{2})^{T} y_{i} ∥^{2}

∥ y_{i} ∥^{2} - ∥ (λ A_{1} + (1 - λ) A_{2})^{T} y_{i} ∥^{2}

= ∥ y_{i} ∥^{2} - (λ^{2} ∥ A_{1}^{T} y_{i} ∥^{2} + 2 λ (1 - λ) ⟨ A_{1}^{T} y_{i}, A_{2}^{T} y_{i} ⟩ + (1 - λ)^{2} ∥ A_{2}^{T} y_{i} ∥^{2})

\geq ∥ y_{i} ∥^{2} - (λ^{2} ∥ A_{1}^{T} y_{i} ∥^{2} + 2 λ (1 - λ) ∥ A_{1}^{T} y_{i} ∥∥ A_{2}^{T} y_{i} ∥ + (1 - λ)^{2} ∥ A_{2}^{T} y_{i} ∥^{2})

\geq ∥ y_{i} ∥^{2} - (λ^{2} ∥ y_{i} ∥^{2} + 2 λ (1 - λ) ∥ y_{i} ∥^{2} + (1 - λ)^{2} ∥ y_{i} ∥^{2}) = 0.

F_{ε} (A) : = ∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} + ε,

F_{ε} (A) : = ∥ y_{i} ∥^{2} - ∥ A^{T} y_{i} ∥^{2} + ε,

\nabla F_{ε} (A) = - \frac{1}{F _{ε} ( A )} y_{i} y_{i}^{T} A .

\nabla F_{ε} (A) = - \frac{1}{F _{ε} ( A )} y_{i} y_{i}^{T} A .

\nabla^{2}F_{\varepsilon}(A)[H]=-\frac{1}{F_{\varepsilon}(A)^{2}}\Bigl{(}y_{i}y_{i}^{\mathrm{T}}HF_{\varepsilon}(A)+\frac{1}{F_{\varepsilon}(A)}\langle y_{i}y_{i}^{\mathrm{T}}A,H\rangle y_{i}y_{i}^{\mathrm{T}}A\Bigr{)},

\nabla^{2}F_{\varepsilon}(A)[H]=-\frac{1}{F_{\varepsilon}(A)^{2}}\Bigl{(}y_{i}y_{i}^{\mathrm{T}}HF_{\varepsilon}(A)+\frac{1}{F_{\varepsilon}(A)}\langle y_{i}y_{i}^{\mathrm{T}}A,H\rangle y_{i}y_{i}^{\mathrm{T}}A\Bigr{)},

\displaystyle\bigl{\langle}\nabla_{A}^{2}F_{\varepsilon}(A)[H],H\bigr{\rangle}

\displaystyle\bigl{\langle}\nabla_{A}^{2}F_{\varepsilon}(A)[H],H\bigr{\rangle}

\displaystyle=-\frac{1}{F_{\varepsilon}(A)^{2}}\Bigl{(}F_{\varepsilon}(A)\|H^{\mathrm{T}}y_{i}\|^{2}+\frac{1}{F_{\varepsilon}(A)}\langle y_{i}y_{i}^{\mathrm{T}}A,H\rangle^{2}\Bigr{)}\leq 0.

F_{i} = in f {F_{ε} \mathchar 58 ε > 0}

F_{i} = in f {F_{ε} \mathchar 58 ε > 0}

\partial F_{i} (A_{0})

\partial F_{i} (A_{0})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\setkomafont

sectioning \setkomafonttitle \setkomafontdescriptionlabel

On the Rotational Invariant $L_{1}$ -Norm PCA

Sebastian Neumayer111Department of Mathematics, Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, D-67663 Kaiserslautern, Germany, {neumayer,nimmer,steidl}@mathematik.uni-kl.de.

Max Nimmer111Department of Mathematics, Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, D-67663 Kaiserslautern, Germany, {neumayer,nimmer,steidl}@mathematik.uni-kl.de. 444corresponding author

Simon Setzer333Engineers Gate, London, United Kingdom

Gabriele Steidl111Department of Mathematics, Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, D-67663 Kaiserslautern, Germany, {neumayer,nimmer,steidl}@mathematik.uni-kl.de. 222Fraunhofer ITWM, Fraunhofer-Platz 1, D-67663 Kaiserslautern, Germany

Abstract

Principal component analysis (PCA) is a powerful tool for dimensionality reduction. Unfortunately, it is sensitive to outliers, so that various robust PCA variants were proposed in the literature. One of the most frequently applied methods for high dimensional data reduction is the rotational invariant $L_{1}$ -norm PCA of Ding and coworkers. So far no convergence proof for this algorithm was available. The main topic of this paper is to fill this gap. We reinterpret this robust approach as a conditional gradient algorithm and show moreover that it coincides with a gradient descent algorithm on Grassmannian manifolds. Based on the latter point of view, we prove global convergence of the whole series of iterates to a critical point using the Kurdyka-Łojasiewicz property of the objective function, where we have to pay special attention to so-called anchor points, where the function is not differentiable.

Keywords: Principal component analysis, Dimensionality reduction, Robust subspace fitting, Conditional gradient algorithm, Frank-Wolfe algorithm, Optimization on Grassmannian manifolds

MSC: 58C05, 62H25, 65K10

1 Introduction

In exploratory data analysis, principal component analysis (PCA) [37] still is one of the most popular tools for dimensionality reduction. Given $N\geq d$ data points $x_{1},\ldots,x_{N}\in\mathbb{R}^{d}$ , it finds a $K$ -dimensional affine subspace $\{\hat{A}\,t+\hat{b}\mathrel{\mathop{\mathchar 58\relax}}t\in\mathbb{R}^{K}\}$ , $1\leq K\leq d$ , of $\mathbb{R}^{d}$ having smallest squared Euclidean distance from the data:

[TABLE]

where $\|\cdot\|$ denotes the Euclidean norm. While $\hat{b}$ in the above minimization problem is not unique, every minimizing affine subspace goes through the offset(bias) $\bar{b}\coloneqq\frac{1}{N}(x_{1}+\ldots+x_{N})$ . Therefore, we restrict our attention to data points $y_{i}\coloneqq x_{i}-\bar{b}$ , $i=1,\ldots,N$ , and subspaces through the origin minimizing the squared Euclidean distances to the $y_{i}$ , $i=1,\ldots,N$ . Setting further the gradient with respect to $t\in\mathbb{R}^{K}$ to zero and allowing only orthonormal columns in $A$ , the PCA problem becomes

[TABLE]

where

[TABLE]

denotes the orthogonal projection onto $\mathcal{R}(A)^{\perp}=\mathcal{N}(A^{\mathrm{T}})$ and $I_{d}$ the $d\times d$ identity matrix. Here $\mathcal{R}(A)$ denotes the range and $\mathcal{N}(A)$ the kernel of $A$ . One convenient property of PCA is the nestedness of the PCA subspaces, i.e., for $K<\tilde{K}\leq d$ , the optimal $K$ -dimensional PCA subspace is contained in the $\tilde{K}$ -dimensional one. Furthermore, it is very fast, as the columns of an optimal $\hat{A}$ are the eigenvectors corresponding to the $K$ largest eigenvalues of the empirical covariance matrix $\frac{1}{N-1}\sum_{i=1}^{N}y_{i}y_{i}^{\mathrm{T}}$ which can be computed in polynomial time.

Unfortunately, PCA is sensitive to outliers appearing quite often in real-world data sets, see Fig. 1. A lot of different methods in robust statistics [14, 24, 29] and optimization were proposed to make the dimensionality reduction more robust. One possibility consists of removing outliers before computing the principal components which has the serious drawback that outliers are difficult to identify and other data points are often falsely labeled as outliers. Another approach assigns different weights to data points based on their estimated relevancy, to get a weighted PCA, see, e.g. [18]. The RANSAC algorithm [10] repeatedly estimates the model parameters from a random subset of the data points until a satisfactory result is obtained as indicated by the number of data points within a certain error threshold. In a similar vein, least trimmed squares PCA models [38, 40] aim to exclude outliers from the squared error function, but in a deterministic way. The variational model in [6] decomposes the data matrix $Y=(y_{1}\ldots y_{N})$ into a low rank and a sparse part. Related approaches such as [31, 43] separate the low rank component from the column sparse one using different norms in the variational model. However, such a decomposition is not always realistic. From a statistical point of view, robust subspace recovery can be done via Tyler’s M-estimator [41, 44], a special case of M-estimators of covariance. But due to the large number of variables to be estimated for the scatter matrix, this approach is not feasible for high dimensional data. Another group of robust PCA approaches replaces the squared $L_{2}$ norm in the PCA by the $L_{1}$ norm. Then, the minimization of the energy functional can be addressed by linear programming, see, e.g., Ke and Kanade [16]. Unfortunately, this norm is not rotationally invariant meaning that in general for an orthogonal matrix $Q$ we have $\|Qx\|_{1}\neq\|x\|_{1}$ . Consequently, when rotating the centered data points $y_{i}$ , the minimizing subspace is not rotated in the same way.

In this paper, the focus lies on the model

[TABLE]

where in contrast to (1) we do not square the Euclidean norm. First of all, let us mention that the determination of the offset $b\in\mathbb{R}^{d}$ is not straightforward now. Frequently, the geometric median of the data is used as offset, which is in general not a minimizer of (3), see [33, Sect. 5]. However, in this paper it is assumed that $b\in\mathbb{R}^{d}$ is fixed and already subtracted from the data. Then, (3) reduces to

[TABLE]

A slightly different form of this model became popular under the name rotational invariant $L_{1}$ -norm PCA by a paper of Ding et al. [8].

It is important to note that in contrast to the classical PCA the hierarchical structure of the approach is lost. This is exemplified in Fig. 2. We mention that several models applying the deflation technique of standard PCA in a robust setting were also provided in the literature. These models are not of interest in this paper, but we refer to the book [13, p. 203] and the collection of papers [12, 17, 20, 23, 25, 31, 35] which is clearly not complete.

Despite the loss of the nestedness property and the non-smooth target function which is harder to optimize, the robustness makes (4) an attractive choice in practical problems. Examples of improved performance compared to standard PCA and other robust alternatives can be found in, e.g., [8, 22]. Furthermore, in [30] it was shown that under certain assumptions on the given so-called inlier and outlier data, by minimizing (4) we can recover the exact underlying subspace spanned by the inlier data points. Additionally, in contrast to convex relaxation methods, the approach is well suited for high dimensional data appearing, e.g., in image processing.

The minimization of (4) has been treated before. Ding et al. [8] suggest a constrained minimization based on a power method without convergence proof. In [30] the optimization problem is tackled by a geodesic gradient descent approach on the Grassmannian and under certain assumptions on the data, local convergence to the global minimizer is shown for an appropriate starting step size and initial iterate. A tight convex relaxation was studied in [23], where the projection matrix $AA^{\mathrm{T}}$ is estimated. Due to the much larger size of the projection matrix, this approach is not suitable for high-dimensional data. An approach based on iteratively reweighted least squares can be found in [21], where standard PCA is repeatedly applied to rescaled data points. For the special case of one-dimensional subspaces ( $K=1$ ), a Weiszfeld-like algorithm with convergence was given in [33].

In this paper, the iterative algorithm of Ding et al. [8] is first interpreted as a conditional gradient algorithm, also known as Frank-Wolfe algorithm, which only implies a certain convergence behavior of subsequences of the iterates. Recalling that we are not interested in the columns of the minimizer $\hat{A}$ in (4) itself, but just in the $K$ -dimensional subspace spanned by them, we show that the algorithm can be recast as a gradient descent algorithm on the Grassmannian manifold. This enables us to prove global convergence of the whole sequence of iterates under mild assumptions.

The paper is organized as follows: In Section 2, we recall preliminaries on Stiefel and Grassmannian manifolds. We discuss important properties of the robust PCA functional in Section 3. Section 4 shows the equivalence of the algorithm of Ding et al. [8], the conditional gradient algorithm and a gradient descent algorithm on Grassmannians. The proof of global convergence of the whole sequence of iterates under some restrictions on the so-called anchor points is given in Section 5. Finally, Section 6 addresses the topic of anchor points.

2 Preliminaries on Stiefel and Grassmannian Manifolds

In this section, we briefly provide the basic notation on Stiefel and Grassmannian manifolds which is required in our approach. Good references on the topic, in particular for optimization on these manifolds, are [1, 9].

Let $K\leq d$ . The (compact) Stiefel manifold is defined by

[TABLE]

For $K=1$ , it coincides with the unit sphere $S_{d,1}=S^{d-1}$ in $\mathbb{R}^{d}$ and for $K=d$ with the orthogonal matrices $S_{d,d}=O(d)$ . The tangent space at $A\in S_{d,K}$ is given by

[TABLE]

where $A_{\perp}$ denotes a matrix with orthonormal columns which are in addition orthogonal to the columns of $A$ . There are two common ways to define inner products on the tangent space such that $S_{d,K}$ becomes a Riemannian manifold, namely

i)

the Frobenius inner product $\langle H_{1},H_{2}\rangle_{F}\coloneqq\operatorname{tr}(H_{1}^{\mathrm{T}}H_{2})$ , or

ii)

the canonical inner product $\langle H_{1},H_{2}\rangle_{A}\coloneqq\operatorname{tr}\bigl{(}H_{1}^{\mathrm{T}}(I_{d}-\frac{1}{2}AA^{\mathrm{T}})H_{2}\bigr{)}$ .

In the rest of this paper $\|A\|=\operatorname{tr}(A^{\mathrm{T}}A)^{\frac{1}{2}}$ always denotes the Frobenius norm of $A$ . The first inner product appears when considering $S_{d,K}$ as submanifold of the Euclidean space $\mathbb{R}^{d,K}$ , while the second one relies the quotient structure $S_{d,K}=O(d)/O(d-K)$ . We are mainly interested in the $K$ -dimensional subspace spanned by the columns of $A\in S_{d,K}$ , which does not change if we multiply $A$ from the right with an orthogonal matrix $Q\in O(K)$ . This is pictured by the Grassmannian manifold, or just Grassmannian, which can be defined as quotient manifold of the Stiefel manifold $G_{d,K}\coloneqq S_{d,K}/O(K)$ . The equivalence classes $[A]\coloneqq\{AQ\mathrel{\mathop{\mathchar 58\relax}}\ Q\in O(K)\}$ belonging to $G_{d,K}$ can be represented by elements $A$ of the Stiefel manifold. The tangent space of $G_{d,K}$ at $[A]$ can be identified with its horizontal lift at $A$ ,

[TABLE]

Further, the Grassmannian becomes a Riemannian manifold by reducing the Riemannian metrics in i) or equivalently ii) to $T_{[A]}G_{d,K}$ , i.e., for any representative $A\in S_{d,K}$ and $H_{1},H_{2}\in T_{[A]}G_{d,K}$ ,

[TABLE]

A possible choice for a metric on the Grassmannian is given by

[TABLE]

where $A_{1},A_{2}\in S_{d,K}$ and $\|\cdot\|_{2}$ is the spectral norm.

In PCA we aim to find an optimal subspace, which means that we are interested in elements of Grassmannians. In practice, working with equivalence classes is difficult and hence calculations are performed with representatives on the Stiefel manifold.

The proposed optimization algorithms involve the orthogonal projection $\Pi_{S_{d,K}}\colon\mathbb{R}^{d,K}\rightarrow S_{d,K}$ , i.e.,

[TABLE]

To this end, recall that the polar decomposition of a matrix $M\in\mathbb{R}^{d,K}$ is given by $M=QS$ , where $Q\in S_{d,K}$ and $S\in\mathbb{R}^{K,K}$ is symmetric and positive semi-definite. Starting with the (economy-size) singular value decomposition $M=U\Sigma V^{\mathrm{T}}$ , where $U\in S_{d,K}$ , $\Sigma\in\mathbb{R}^{K,K}$ is a diagonal matrix and and $V\in S_{K,K}$ , the polar decomposition is determined by $Q\coloneqq\operatorname{Polar}(M)\coloneqq UV^{\mathrm{T}}$ and $S\coloneqq V\Sigma V^{\mathrm{T}}$ . The following lemma can be found e.g. in [15, 38].

Lemma 2.1.

The orthogonal projection $\Pi_{S_{d,K}}\colon\mathbb{R}^{d,K}\rightarrow S_{d,K}$ is given by

[TABLE]

If $M$ has full rank, then $\operatorname{Polar}(M)=M(M^{\mathrm{T}}M)^{-\frac{1}{2}}$ .

3 Model Analysis

The main focus of this section lies on investigating the objective function in (4) and a related function with respect to differentiability and convexity. To be precise, we are actually only interested in minimizing over equivalence classes $[A]\coloneqq\{AQ\mathrel{\mathop{\mathchar 58\relax}}\ Q\in O(K)\}$ . Besides the objective function

[TABLE]

we also deal with the function

[TABLE]

Clearly, these two functions take the same values on the Stiefel manifold $S_{d,K}$ . However, they have quite different properties as functions on $\mathbb{R}^{d,K}$ , but this was often neglected in existing approaches. While $E$ is well-defined on the whole $\mathbb{R}^{d,K}$ , the function $F$ is only well defined on the closed domain

[TABLE]

and therefore it is extended to the whole $\mathbb{R}^{d,K}$ by

[TABLE]

For $A\in S_{d,K}$ and all $y\in\mathbb{R}^{d}$ , it holds $\|A^{\mathrm{T}}y\|\leq\|y\|$ so that $S_{d,K}\subset{\mathcal{D}}$ . Further, $A\in S_{d,K}\cap\partial{\mathcal{D}}$ if and only if $\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{i}\|=0$ for some $i\in\{1,\ldots,N\}$ . The compact subset of $\mathbb{R}^{d,K}$

[TABLE]

is called anchor set.

In the simple case $N=d=K=1$ and $y_{1}=1$ , the above functions read $E(A)=|1-A^{2}|$ and $F(A)=\sqrt{1-A^{2}}$ with $A\in\mathbb{R}$ . While the first function is locally Lipschitz continuous on $[-1,1]$ , the second one does not have this property at $A=\pm 1$ . The following two lemmata state properties of $E$ and $F$ .

Lemma 3.1.

The function $E$ defined by (7) is locally Lipschitz continuous on $\mathbb{R}^{d,K}$ .

Proof.

It suffices to show the property for the summands $E_{i}$ . For an arbitrary fixed $A\in S_{d,K}$ , let $\|A-A_{i}\|\leq\varepsilon$ , $i=1,2$ . Then, we obtain

[TABLE]

∎

In general, the function $E$ is neither convex nor concave on ${\mathcal{D}}$ , see Fig. 3.

In contrast, $F$ is concave as the following lemma shows.

Lemma 3.2.

The function $F$ defined by (8) and (10) fulfills the following relations:

$\mathrm{dom}(-F)\coloneqq\{A\in\mathbb{R}^{d,K}\mathrel{\mathop{\mathchar 58\relax}}-F(A)<+\infty\}={\mathcal{D}}$ * is convex.* 2. 2.

$-F$ * is convex.* 3. 3.

The subdifferential of $-F$ is empty at the boundary of ${\mathcal{D}}$ , i.e. at $A\in\mathbb{R}^{d,K}$ with $\|y_{i}\|^{2}-\|A^{\mathrm{T}}y_{i}\|^{2}=0$ for some $i\in\{1,\ldots,N\}$ .

Proof.

It holds $A\in\mathrm{dom}(-F)$ if and only if

[TABLE]

for all $i=1,\ldots,N$ . Since the intersection of convex sets is convex, it suffices to show convexity of $\mathrm{dom}(-F_{i})$ separately. Let $A_{1},A_{2}\in\mathrm{dom}(-F_{i})$ . Then, using (12), we obtain for $\lambda\in[0,1]$ that

[TABLE]

Thus, $\lambda A_{1}+(1-\lambda)A_{2}\in\mathrm{dom}(-F_{i})$ and the claim follows. 2. 2.

Since the sum of concave functions is concave again, it suffices to consider the individual summands $F_{i}$ again. For $\varepsilon>0$ , we define

[TABLE]

which is differentiable on an open set containing ${\mathcal{D}}_{i}$ . By the chain rule and since $\frac{\partial}{\partial A}\operatorname{tr}(y_{i}^{\mathrm{T}}AA^{\mathrm{T}}y_{i})=2y_{i}y_{i}^{\mathrm{T}}A$ , the gradient of $F_{\varepsilon}$ is

[TABLE]

Using the product rule and the chain rule, the Hessian is given by

[TABLE]

for all $H\in\mathbb{R}^{d,K}$ so that

[TABLE]

Consequently, the Hessian is negative semidefinite and $F_{\varepsilon}$ concave in ${\mathcal{D}}_{i}$ for all $\varepsilon>0$ . Finally,

[TABLE]

is concave as the pointwise infimum of a family of concave functions. 3. 3.

For an arbitrary fixed $i\in\{1,\ldots,N\}$ , let $A_{0}\in\mathbb{R}^{d,K}$ with $\|y_{i}\|^{2}-\|A_{0}^{\mathrm{T}}y_{i}\|^{2}=0$ . We consider the subdifferential of $F_{i}$ at $A_{0}$ given by

[TABLE]

Choosing $A\coloneqq\alpha A_{0}$ with $\alpha\in[0,1]$ , a subgradient $P$ must fulfill

[TABLE]

which leads to a contradiction if $\alpha\rightarrow 1$ . Hence, the subdifferential is empty.

∎

For the algorithms the gradient and the Riemannian gradient on the Grassmannian of the functions $E$ and $F$ are required.

Lemma 3.3.

Let $E$ and $F$ be defined by (7) and (8), respectively. Then, the gradient $\nabla$ and the Riemannian gradient $\nabla_{A}$ on $S_{d,K}$ at $A\in S_{d,K}\backslash{\mathcal{A}}$ are given by

[TABLE]

where

[TABLE]

Note that $-P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}\,C_{A}\,A$ is also the horizontal lift of the gradient $\nabla_{[A]}\tilde{E}([A])$ on the Grassmannian at $A$ , where $E=\tilde{E}\circ\pi$ and $\pi$ is the projection from $S_{K,d}$ onto $G_{K,d}$ .

Proof.

By straightforward computation we obtain for $A\in\mathbb{R}^{d,K}$ that

[TABLE]

For $A\in S_{d,K}$ the gradient of $E_{i}$ coincides with the Riemannian gradient of $E_{i}$ on $S_{d,K}$ , i.e.,

[TABLE]

This implies the assertion. ∎

We call $A\in S_{d,K}\backslash\mathcal{A}$ a critical point of $F$ , resp. $E$ if

[TABLE]

4 Minimization Algorithm

In this section, we show that the constrained minimization algorithm of Ding et al. [8] can be interpreted as i) a conditional gradient algorithm, and ii) a gradient descent algorithm on Grassmannians. The conditional gradient algorithm, also known as Frank-Wolfe algorithm, was originally proposed 1956 in [11] for solving linearly constrained quadratic programs and was later adapted to other problems. For a good overview, we refer to [27] and the references therein. Then it follows from general results on the algorithm that a subsequence of the iterates converges under strong conditions on the anchor points to a critical point of the functional.

Constrained Minimization Algorithm.

Ding et al. [8] consider the constrained optimization problem

[TABLE]

The authors claimed without proof that the function $F$ is convex in $AA^{\mathrm{T}}$ and has a unique global minimizer. Both statements are not correct: for $N=K=d=1$ and $y_{1}=1$ it is easy to check that $F(A)=\sqrt{1-A^{2}}$ is concave in $A^{2}\in[0,1]$ ; for $N=2$ , $K=1$ , $d=2$ with centered data points $y_{1}=(-1/2,\sqrt{3}/2)^{\mathrm{T}}$ and $y_{2}=(1/2,\sqrt{3}/2)^{\mathrm{T}}$ the minimizers of $F(A)$ are given by $A=(-1/2,\sqrt{3}/2)^{\mathrm{T}}$ and $A=(1/2,\sqrt{3}/2)^{\mathrm{T}}$ , which span different subspaces. Penalizing the constraint in (16) via a symmetric Lagrange multiplier $\Lambda\in\mathbb{R}^{K,K}$ , setting the gradient of the resulting Lagrangian $L(A,\Lambda)\coloneqq F(A)+\langle\Lambda,A^{T}A-I_{K}\rangle$ with respect to $A$ to zero and applying an orthogonalization procedure, see Lemma 2.1, the authors arrive at the following iteration scheme: if $A^{(r)}\not\in\mathcal{A}$ ,

[TABLE]

Conditional Gradient Algorithm.

The conditional gradient algorithm is commonly used to minimize a convex function over a compact set. However, as in [27], we apply it for maximizing the convex function $-F$ .

In general, for a proper convex function $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\}$ and a nonempty, compact set ${\mathcal{C}}\subset\mathrm{int}(\operatorname{dom}f)$ , the conditional gradient algorithm is the update scheme

[TABLE]

Note that according to [39, Corollary 32.4.1], the value $\hat{u}\in{\mathcal{C}}$ is a local maximizer of $f$ over ${\mathcal{C}}$ if for all $v\in{\mathcal{C}}$ ,

[TABLE]

By definition of the subdifferential we have

[TABLE]

where the last equation follows by choosing $v=u^{(r)}\in{\mathcal{C}}$ .

For finite convex functions $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ the following convergence result was proved in [27] based on [28]. The proof can be modified for $f$ with values in the extended real numbers and ${\mathcal{C}}\subset\mathrm{int}(\operatorname{dom}f)$ in a straightforward way.

Theorem 4.1.

Let $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\}$ a proper convex function and ${\mathcal{C}}\subset\mathrm{int}(\operatorname{dom}f)$ a nonempty, compact set. Then the sequence $\{f(u^{(r)})\}_{r}$ generated by (18) is strictly increasing except when $\max_{u\in{\mathcal{C}}}\langle u-u^{(r)},p^{(r)}\rangle=0$ , in which case it terminates at $u^{(r)}$ satisfying (19). If $f$ is continuously differentiable on $\mathrm{int}(\operatorname{dom}f)$ , then every accumulation point $\hat{u}$ of the sequence $\{u^{(r)}\}_{r}$ fulfills (19).

We want to apply the scheme (18) for $f\coloneqq-F$ and ${\mathcal{C}}={\mathcal{C}}_{\varepsilon}\coloneqq S_{d,K}\backslash{\mathcal{A}}_{\varepsilon}\subset\mathrm{int}\left(\operatorname{dom}(-F)\right)$ , where

[TABLE]

denotes the set of matrices in $S_{d,K}$ having a distance smaller than some $\varepsilon>0$ from the anchor set. To this end, the maximization problem (18) has to be solved, but first we have to find a suitable $\varepsilon$ . For this purpose define the iteration

[TABLE]

for $A^{(r)}\notin\mathcal{A}$ , where we plugged $\partial F(A^{(r)})=\{\nabla F(A^{(r)})\}$ as given in Lemma 3.3 into (18). Now, assume that we can find $\varepsilon>0$ such that $A^{(r)}\in\mathcal{C}_{\varepsilon}$ for all $r\geq 0$ . In this case

[TABLE]

Note that we can always find such an $\varepsilon$ for $r$ large enough if all accumulation points of $A^{(r)}$ as in (21) are non-anchor points. Using Lemma 2.1 we obtain

[TABLE]

which is exactly the iteration scheme (17) proposed by Ding et al. Based on Theorem 4.1, we have the following corollary for our special setting.

Corollary 4.2.

Let $F$ be defined by (8)-(10). Assume that the sequence $\{A^{(r)}\}_{r}$ generated by (22) has no element in ${\mathcal{A}}$ and that the set of accumulation points has a positive distance from ${\mathcal{A}}$ . Then the sequence $\{F\left(A^{(r)}\right)\}_{r}$ is strictly decreasing except for iterates where

[TABLE]

in which case the iteration terminates at $A^{(r)}$ which is a critical point. Condition (23) is equivalent to $A^{(r+1)}=A^{(r)}$ , resp. to $\nabla_{A^{(r)}}F(A^{(r)})=0$ . If the iteration does not terminate after a finite number of steps, every accumulation point of $\{A^{(r)}\}_{r}$ is a critical point.

Proof.

We show that the three stopping criteria are equivalent. Let $A\coloneqq A^{(r)}\not\in\mathcal{A}$ and recall that $\nabla_{A}F(A)=-P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A}A$ .

If $A^{(r+1)}=A$ , then $C_{A}A=A(A^{\mathrm{T}}C_{A}^{2}A)^{\frac{1}{2}}$ and thus $P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A}A=0$ .
If $P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A}A=0$ , then $C_{A}A=A(A^{\mathrm{T}}C_{A}A)$ and thus

[TABLE]

Further, this implies (23).

Assume now that (23) is fulfilled. Then, we have

[TABLE]

On the other hand, we have with $\operatorname{tr}(AB)=\operatorname{tr}(BA)$ that

[TABLE]

where $\lambda_{\max}$ denotes the largest eigenvalue of the matrix $(A^{\mathrm{T}}C_{A}^{2}A)^{\frac{1}{2}}+A^{\mathrm{T}}C_{A}A$ . Hence, $P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A}A=0$ .

It remains to show that all accumulation points of infinite sequences are critical points. Let $\hat{A}\notin\mathcal{A}$ be an accumulation point of such a sequence. As the set of accumulation points has positive distance from $\mathcal{A}$ , we can choose $\varepsilon$ small enough such that all iterates are in $\mathrm{int}(\mathcal{C}_{\varepsilon})$ for $r$ large enough. Then, by Theorem 4.1, the accumulation point $\hat{A}$ fulfills (19) and as $E$ is differentiable in $\hat{A}$ , this implies that $\hat{A}$ is a critical point. ∎

Remark 4.3.

Unfortunately, the function $-F$ has no subdifferential at the boundary of its domain and $S_{d,K}$ touches this boundary in the anchor set. A remedy would be to use instead of $F$ the function

[TABLE]

By the proof of Lemma 3.2, we conclude that $-F_{\varepsilon}$ is convex on an open set which contains $\mathcal{D}$ and therefore also $S_{d,K}$ . Thus, accumulation points of the sequence produced by the conditional gradient algorithm are critical points by Theorem 4.1. Another idea consists of switching to a function with summands $\varphi(\sqrt{\|y_{i}\|^{2}-\|A^{\mathrm{T}}y_{i}\|^{2}})$ , where $\varphi$ is e.g. the Huber function as proposed in [8]. This approach is not pursued any further, since we are more interested in finding an algorithm for the original function without an additional parameter.

Gradient Descent Algorithm on $G_{d,K}$ .

By (15), a matrix $A\in S_{d,K}\setminus\mathcal{A}$ is a critical point of $E$ , resp. $F$ , on $S_{d,K}$ if and only if $P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A}A=0$ . This can be rewritten as

[TABLE]

where $S_{A}\in\mathbb{R}^{K,K}$ is assumed to be invertible which is the case under the reasonable assumption that ${\mathcal{R}}(A)\subset\mathrm{span}(Y)$ and $\mathrm{dim}(\mathrm{span}(Y))\geq K$ , where $Y\coloneqq(y_{1}\,\ldots\,y_{N})$ .

Remark 4.4.

Note that $-\nabla_{A}E(A)S_{A}^{-1}=P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A}AS_{A}^{-1}\in T_{[A]}G_{d,K}\subset T_{A}S_{d,K}$ . Let $S_{A}=Q\Lambda Q^{\mathrm{T}}$ be an eigenvalue decomposition with eigenvalues $\lambda_{1}\geq\ldots\geq\lambda_{K}>0$ of $S_{A}$ in the diagonal matrix $\Lambda\in\mathbb{R}^{K,K}$ . Plugging this into (29), multiplying with $Q$ from the right and substituting $\tilde{A}\coloneqq AQ\in S_{d,K}$ we get

[TABLE]

*so that the columns of $\tilde{A}$ are eigenvectors of $C_{\tilde{A}}=\sum_{i=1}^{N}\frac{1}{\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{\tilde{A}}y_{i}\|}y_{i}y_{i}^{\mathrm{T}}$ with eigenvalues $\lambda_{k}>0$ .

Using the same relations in (31), we arrive at*

[TABLE]

which allows the interpretation of $\frac{1}{\lambda_{k}}$ , $k=1,\ldots,K$ as columns-wise step sizes in the gradient descent iteration (32).

Together with Lemma 2.1, the fixed point equation (31) gives rise to the following descent scheme on $S_{d,K}$ , resp. $G_{d,K}$ :

[TABLE]

Note the strong connection of this iterative scheme to the Weiszfeld algorithm [4, 42, 34] and majorize-minimize strategies [7].

By the following lemma, the gradient descent iteration (32) coincides with those of the conditional gradient algorithm (22) on the Grassmannian $G_{d,K}$ .

Lemma 4.5.

For the same input matrix $A^{(0)}$ , the iterates generated by the schemes (32) and (22) coincide on $G_{d,K}$ , i.e., they span the same subspace.

Proof.

Since $C_{AQ}=C_{A}$ for $Q\in O(K)$ , we observe that the matrices produced by the update schemes (32) and (22) span the same subspace as they differ only by a multiplication with an invertible matrix from the right. Since both iterates are in the Stiefel manifold, they can only differ by orthogonal matrix, i.e., $\Pi_{S_{d,K}}(C_{A^{(r)}}A^{(r)}S_{A^{(r)}}^{-1})=\Pi_{S_{d,K}}(C_{A^{(r)}}A^{(r)})Q$ for some $Q\in O(K)$ . ∎

The following lemma quantizes the relation from Corollary 4.2 that $\{E(A^{(r)})\}_{r}$ is decreasing.

Lemma 4.6.

If $A^{(r)}\notin\mathcal{A}$ , then $A^{(r+1)}$ generated by (32) satisfies

[TABLE]

Proof.

In order to shorten notation, we denote $\tilde{A}=A^{(r+1)}$ and $A=A^{(r)}$ . For $x\geq 0,y>0$ it holds $x-y\leq\frac{x^{2}-y^{2}}{2y}$ so that

[TABLE]

Using $\|u-v\|^{2}-\|w-v\|^{2}=2\langle u-w,u-v\rangle-\|u-w\|^{2}$ , this can be rewritten as

[TABLE]

The first sum can be simplified to

[TABLE]

so that

[TABLE]

∎

5 Convergence Analysis

So far Corollary 4.2 only ensures convergence of a subsequence of the iterates to a critical point under a restrictive assumption on the anchor directions. In this section, we prove global convergence of the whole sequence of iterates generated by algorithm (32) on the Stiefel manifold (and thereby on the Grassmannian) under mild assumptions which are summarized at the end of this section.

The important observation is that both $E$ and $F$ are semi-algebraic functions. Such functions are typical examples of so-called Kurdyka-Łojasiewicz (KL) functions [2, 19, 26]. A function $f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{+\infty\}$ with Fréchet limiting subdifferential $\partial f$ , see [32], is said to have the Kurdyka–Łojasiewicz (KL) property at $u^{*}\in\operatorname{dom}\partial f$ if there exist $\eta\in(0,+\infty)$ , a neighborhood $U$ of $u^{*}$ and a continuous concave function $\phi\colon[0,\eta)\to\mathbb{R}_{\geq 0}$ such that

$\phi(0)=0$ , 2. 2.

$\phi$ is $C^{1}$ on $(0,\eta)$ , 3. 3.

for all $s\in(0,\eta)$ it holds $\phi^{\prime}(s)>0$ , 4. 4.

for all $x\in U\cup[f(u^{*})<f<f(u^{*})+\eta]$ , the Kurdyka–Łojasiewicz inequality $\phi^{\prime}(f(u)-f(u^{*})){\mathrm{d}}(0,\partial f(u))\geq 1$ holds true.

A proper, lower semi-continuous (lsc) function which satisfies the KL property at each point of $\operatorname{dom}\partial f$ is called KL-function. For KL-functions, the following theorem [3, Theorem 2.9] holds true.

Theorem 5.1.

Let $f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{\infty\}$ be a KL function. Let $\{u^{(r)}\}_{r\in\mathbb{N}}$ be a sequence which fulfills the following conditions:

C1)

There exists $K_{1}>0$ such that $f(u^{(r+1)})-f(u^{(r)})\leq-K_{1}\|u^{(r+1)}-u^{(r)}\|^{2}$ for every $r\in\mathbb{N}$ . 2. C2)

There exists $K_{2}>0$ such that for every $r\in\mathbb{N}$ there exists $w_{r+1}\in\partial f(u^{(r+1)})$ with $\|w_{r+1}\|\leq K_{2}\|u^{(r+1)}-u^{(r)}\|$ , where $\partial f$ denotes the Fréchet limiting subdifferential of $f$ **[32]**. 3. C3)

There exists a convergent subsequence $\{u^{(r_{j})}\}_{j\in\mathbb{N}}$ with limit $\hat{u}$ and $f(u^{(r_{j})})\to f(\hat{u})$ .

Then the whole sequence $\{u^{(r)}\}_{r\in\mathbb{N}}$ converges to $\hat{u}$ and $\hat{u}$ is a critical point of $f$ in the sense that $0\in\partial f(\hat{u})$ . Moreover the sequence has finite length, i.e.,

[TABLE]

Similar arguments as used in the proof of Theorem 5.1 lead to the next corollary, see [3, Corollary 2.7].

Corollary 5.2.

Let $f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{+\infty\}$ be a KL function. Denote by $U$ , $\eta$ and $\phi$ the objects appearing in the definition of the KL function. Let $\delta,\rho>0$ be such that $B(u^{*},\delta)\subset U$ with $\rho\in(0,\delta)$ . Consider a finite sequence $u^{(r)}$ , $r=0,\dots,n$ , which satisfies the Conditions C1 and C2 of Theorem 5.1 and additionally

C4)

$f(u^{*})\leq f(u^{(0)})<f(u^{*})+\eta$ , 2. C5)

$\|u^{*}-u^{(0)}\|+2\sqrt{\frac{f(u^{(0)})-f(u^{*})}{K_{1}}}+\frac{K_{2}}{K_{1}}\phi(f(u^{(0)})-f(u^{*}))\leq\rho$ .

If for all $r=0,\dots,n$ it holds

[TABLE]

then $u^{(r)}\in B(u^{*},\rho)$ for all $r=0,\dots,n+1$ .

We start our convergence analysis by showing property C1).

Lemma 5.3.

Assume that $A^{(r)}\notin\mathcal{A}$ for all $r\geq 1$ generated by (32). Then, there exists $K_{1}>0$ such that

[TABLE]

Proof.

In order to shorten notation, we denote $\tilde{A}=A^{(r+1)}$ and $A=A^{(r)}$ . By Lemma 4.6 and since $\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{i}\|\leq\|y_{i}\|\leq\max_{i=1,\dots,N}\|y_{i}\|=\vcentcolon 1/2C$ , we estimate

[TABLE]

Next, we want to estimate the sum on the right hand side. To this end, note that

[TABLE]

with some unit vector $y\in\mathcal{R}(Y)$ as $\mathcal{R}(A)\subseteq\mathcal{R}(Y)$ . The latter follows directly from the fact that the columns of $C_{A}$ are in $\mathcal{R}(Y)$ . We can choose a basis of $\mathcal{R}(Y)$ from the data points and w.l.o.g., $y=\sum_{i=1}^{N_{1}}\alpha_{i}y_{i}$ , where $N_{1}=\dim(\mathcal{R}(Y))$ . Then, setting $Y_{N_{1}}\coloneqq(y_{1}\,\ldots\,y_{N_{1}})$ , the coefficients can be estimated by $|\alpha_{i}|\leq\alpha^{*}\coloneqq\|(Y_{N_{1}}^{\mathrm{T}}Y_{N_{1}})^{-1}Y_{N_{1}}^{\mathrm{T}}\|_{\infty}$ for $i=1,\ldots,N_{1}$ . Setting $\alpha_{i}=0$ for all $i>N_{1}$ , we obtain

[TABLE]

Using the equivalence of Frobenius and spectral norm, (38) now results in the estimate

[TABLE]

It remains to show that $\|\tilde{A}\tilde{A}^{\mathrm{T}}-AA^{\mathrm{T}}\|^{2}\geq\|\tilde{A}-A\|^{2}$ . Since $A\in S_{d,K}$ , we get

[TABLE]

where

[TABLE]

All eigenvalues of $B$ are in $(0,1)$ . On the other hand, since $A^{\mathrm{T}}\tilde{A}=B^{\frac{1}{2}}$ , it holds

[TABLE]

Finally, this implies with the smallest eigenvalue $\lambda_{\mathrm{min}}\geq 1$ of the matrix $I_{K}+B^{\frac{1}{2}}$ that

[TABLE]

∎

Corollary 5.4.

Assume that $A^{(r)}\notin\mathcal{A}$ for all $r\geq 1$ generated by (32). Then, it holds

[TABLE]

The set of accumulation points of $\{A^{(r)}\}_{r}$ is compact and connected in $S_{d,K}$ . Every accumulation point $\hat{A}$ which is not an anchor point is a critical point of $E$ , i.e. $\nabla_{\hat{A}}E(\hat{A})=0$ . The same statements hold true for the corresponding equivalence classes in $G_{d,K}$ .

Proof.

Since $\{E(A^{(r)})\}_{r}$ is decreasing and bounded below by zero, we know that $\lim_{r\to\infty}E(A^{(r)})=\hat{E}$ for some $\hat{E}\geq 0$ . Multiplying (37) by $-1$ , summing and taking the limit yields

[TABLE]

Consequently, the series on the right-hand side converges and $\lim_{r\to\infty}\|A^{(r+1)}-A^{(r)}\|=0$ . Using the estimate

[TABLE]

the statement also holds on $G_{d,K}$ .

2. By the theorem of Ostrowski [36, p. 173], it follows that the set of accumulation points of $\{A^{(r)}\}_{r}$ is compact and connected both in $S_{d,K}$ and $G_{d,K}$ .

3. Since the sequence $\{A^{(r)}\}_{r}$ is bounded, it has a convergent subsequence. Let $A^{(r_{j})}$ be a subsequence converging to a non-anchor point $\hat{A}$ and $T$ be the update operator in (32), i.e.,

[TABLE]

Then $T$ is continuous for $A^{(r)}\notin\mathcal{A}$ which we can see as follows: The projection $\Pi_{S_{d,K}}$ is continuous on all full rank matrices. For $A\in S_{d,K}$ ,

[TABLE]

has only eigenvalues larger than $1$ , so that the argument of the projection has full rank. Furthermore, $C_{A}$ (and therefore $S_{A}$ ) depends continuously on $A$ except for $A\in\mathcal{A}$ .

Using the continuity of $T$ outside ${\mathcal{A}}$ , we have $\lim_{j\to\infty}A^{(r_{j}+1)}=\lim_{j\to\infty}T(A^{(r_{j})})=T(\hat{A})$ . By continuity of $E$ , this implies

[TABLE]

By Corollary 4.2, $E$ is strictly decreasing except for $\hat{A}=T(\hat{A})$ in which case $\nabla_{\hat{A}}E(\hat{A})=0$ . ∎

Lemma 5.5.

Assume that the elements of the sequence $\{A^{(r)}\}_{r}$ are generated by (32) and the accumulation points do not belong to the anchor set $\mathcal{A}$ . Then the sequence fulfills C2) in Theorem 5.1, i.e., there exists $K_{2}>0$ such that

[TABLE]

Proof.

By the assumptions and Corollary 5.4 the set of accumulation points has a positive distance $\varepsilon$ from $\mathcal{A}$ . Since $\lim_{r\to\infty}\|A^{(r+1)}-A^{(r)}\|=0$ , we have for $r$ large enough that all lines $\overline{A^{(r)}A^{(r+1)}}$ connecting $A^{(r)}$ and $A^{(r+1)}$ are in a compact set $\Omega\coloneqq\overline{B(0,R)}\setminus{\mathcal{A}}_{\varepsilon/2}$ for some $R>0$ . Here $B(0,R)$ denotes the ball centered at [math] with radius $R$ with respect to the Frobenius norm. Further, $E$ is smooth on an open set containing $\Omega$ so that the mean value theorem implies

[TABLE]

Hence, we can estimate

[TABLE]

Now, (39) implies

[TABLE]

For the second term, we can use the series expansion of the square root given by

[TABLE]

Plugging this in yields

[TABLE]

From Corollary 5.4 we know that at each accumulation point which is not an anchor point the gradient of $E$ is zero, so that $\lim_{r\to\infty}\|\nabla E(A^{(r)})\|=0$ . Furthermore, both $\lambda_{\mathrm{min}}\bigl{(}S_{A^{(r)}}^{-1}\bigr{)}$ and $\lambda_{\mathrm{max}}\bigl{(}S_{A^{(r)}}^{-1}\bigr{)}$ depend continuously on $A^{(r)}$ on the compact set $\Omega$ , so that they can be bounded by their positive minima and maxima on $\Omega$ , respectively. Consequently, $\lim_{r\to\infty}\bigl{\|}\nabla E(A^{(r)})\bigr{\|}^{2}\lambda_{\mathrm{max}}\bigl{(}S_{A^{(r)}}^{-1}\bigr{)}^{4}=0$ and we get

[TABLE]

for some $\tilde{C}>0$ and $r$ large enough. Plugging this into (45), we get for $r$ large enough

[TABLE]

∎

Theorem 5.6.

Assume that the set of iterates $\{A^{(r)}\}_{r}$ generated by (32) is infinite and fulfills $A^{(r)}\notin\mathcal{A}$ for all $r\geq 0$ . Suppose that there is an accumulation point which is not in $\mathcal{A}$ . Then the whole sequence $\{A^{(r)}\}_{r}$ converges a critical point.

Proof.

We distinguish two cases.

If all accumulation points are non-anchor points, then the assertion follows by Theorem 5.1 together with Lemma 5.3, Corollary 5.4 and Lemma 5.5. 2. 2.

If the set accumulation points consists of both anchor and non-anchor points we will show convergence to a non-anchor point by applying Corollary 5.2. Let $\hat{A}$ be an accumulation point which is not in the anchor set $\mathcal{A}$ , i.e., $E_{i}(\hat{A})=\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{\hat{A}}y_{i}\|\neq 0$ for all $i=1,\ldots,N$ . Due to the continuity of $E_{i}$ we can find a ball $B(\hat{A},R)$ around $\hat{A}$ which has positive distance to all anchor points. Next, for all the iterates $A^{(r)}\in B(\hat{A},R/2)$ and $r$ large enough, C1) and C2) are fulfilled by Lemma 5.3 and Lemma 5.5. By the continuity of $E$ and $\phi$ , see also the proof of [3, Theorem 2.9], we can choose a ball $B(\hat{A},\delta)\subset B(\hat{A},R/2)\cap U$ (where $U$ is from the definition of the KL property), $\rho\in(0,\delta)$ and a starting iterate $A^{(r_{0})}\in B(\hat{A},\rho)$ which satisfies C4) and C5) from Corollary 5.2. Since $\lim_{r\to\infty}\|A^{(r+1)}-A^{(r)}\|=0$ and $\hat{A}$ is an accumulation point, we can choose $r_{0}$ such that

[TABLE]

for all $r\geq r_{0}$ . Either all iterates after $A^{(r_{0})}$ are in $B(\hat{A},\rho)$ or there is a finite sequence $A^{(r_{0})},A^{(r_{0}+1)},\ldots,A^{(r_{n})}$ such that $A^{(r_{n}+1)}$ is the first element outside $B(\hat{A},\rho)$ . But then, by Corollary 5.2, also the iterate $A^{(r_{n}+1)}$ is inside $B(\hat{A},\rho)$ and hence all iterates stay in $B(\hat{A},\rho)$ , which is a contradiction. As $\rho$ can be chosen arbitrarily small, the whole sequence converges to the anchor point $\hat{A}$ .

∎

We can summarize our results under the condition that no iterate is in the anchor set as follows: If the sequence generated by (32) has at least one accumulation point where $E$ is differentiable, i.e., at least one accumulation is not an anchor point, then it converges to that point on the Stiefel manifold and it is a critical point of $E$ . In this case, iteration (17) by Ding et al. converges on the Grassmannian as it coincides with (32) there. In particular, this implies local convergence near differentiable local minimizers of $E$ which are isolated on the Grassmannian. Only in the case that all accumulation points are anchor points, which means that they form a connected component and all have the same function value, we cannot prove convergence of the whole sequence on the Stiefel or Grassmannian manifold. In the next section we give partial results for the cases which were not fully treated up to now. We investigate an optimality condition for anchor points and a descent step in non-optimal anchor points.

6 Investigation of Anchor Points

While a local minimizer of $F$ (and $E$ ) on $S_{d,K}\cap\mathrm{int}({\mathcal{D})}$ is characterized by the Riemannian gradient being zero, this is not possible for minimizers lying in the anchor set $\mathcal{A}$ , since $E$ is not differentiable and the subdifferential of $-F$ is empty there.

To formulate optimality conditions for matrices in the anchor set, we use the definition of one-sided directional derivatives. The one-sided directional derivative of a function $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ , $n\in\mathbb{N}$ , at a point $x\in\mathbb{R}^{n}$ in direction $h\in\mathbb{R}^{n}$ is defined by

[TABLE]

The following theorem gives necessary and sufficient conditions for (strict) local minimizers of (locally Lipschitz) continuous functions on $\mathbb{R}^{n}$ using the notion of one-sided derivatives, see [5, Theorems 2.1 & 3.1].

Theorem 6.1.

Let $f\colon\mathbb{R}^{n}\rightarrow\mathbb{R}$ be a function which is one-sided differentiable. Then the following holds true:

If $\hat{x}\in\mathbb{R}^{n}$ is a local minimizer of $f$ on $\mathbb{R}^{n}$ , then $\operatorname{D\!}f(\hat{x};h)\geq 0$ for all $h\in\mathbb{R}^{n}$ . 2. 2.

If $\operatorname{D\!}f(\hat{x};h)>0$ for all $h\in\mathbb{R}^{n}\setminus\{0\}$ and $f$ is locally Lipschitz continuous, then $\hat{x}$ is a strict local minimizer of $f$ on $\mathbb{R}^{n}$ .

Note that $\operatorname{D\!}f(\hat{x};h)\geq 0$ for all $h\in\mathbb{R}^{n}\setminus\{0\}$ does not imply that $\hat{x}$ is a local minimizer of $f$ on $\mathbb{R}^{n}$ . The authors of [5] gave an example that Lipschitz continuity in the second part cannot be weakened to just continuity.

The theorem can be applied for complete Riemannian manifolds $\mathcal{M}$ in the following way. For a function $f\colon\mathcal{M}\to\mathbb{R}$ , the point $\hat{x}\in\mathcal{M}$ is a local minimizer if and only if $0_{\hat{x}}$ is a local minimizer of $\tilde{f}_{\hat{x}}\colon T_{\hat{x}}\mathcal{M}\to\mathbb{R}$ with $\tilde{f}_{\hat{x}}=f\circ\exp_{\hat{x}}$ , where $\exp_{\hat{x}}\colon T_{\hat{x}}\mathcal{M}\to\mathcal{M}$ denotes the exponential map at $\hat{x}$ . The exponential map satisfies $\exp_{\hat{x}}(0_{\hat{x}})=\hat{x}$ and $\mathrm{d}\exp_{\hat{x}}(0_{\hat{x}})=\mathrm{Id}$ , where $\mathrm{d}F$ denote the differential of a smooth mapping $F$ between manifolds, see [1, Section 5.4]. Hence, local minimizers can be checked with the following directional derivative on manifolds

[TABLE]

Now, we want to apply Theorem 6.1 for the locally Lipschitz continuous energy function $E$ . To this end, the norm on $\mathbb{R}^{n,m}$ is defined by

[TABLE]

Then, it is easy to check that the dual norm is given by

[TABLE]

Further, note that for $H\in T_{A}S_{d,K}$ the exponential map $\exp_{A}$ satisfies

[TABLE]

First, the one-sided derivative of $E$ at $A\in S_{d,K}$ in direction $H\in T_{A}S_{d,K}$ is computed.

Lemma 6.2.

The one-sided derivative of $E$ defined in (7) on $S_{d,K}$ reads for $A\in S_{d,K}$ and $H=AX+A_{\perp}Z\in T_{A}S_{d,K}$ as follows

[TABLE]

where $\mathcal{K}\coloneqq\{k\in\{1,\ldots,N\}\mathrel{\mathop{\mathchar 58\relax}}\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{k}\|=0\}$ , $Y_{\mathcal{K}}\coloneqq(y_{k})_{k\in\mathcal{K}}$ and

[TABLE]

Proof.

First, we consider $k\in\mathcal{K}$ , i.e. $P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{k}=0$ and $y_{k}=AA^{\mathrm{T}}y_{k}$ . Then, we obtain for $A\in S_{d,K}$ and $H\in T_{A}S_{d,K}$ that

[TABLE]

and with $g(B)\coloneqq(I_{d}-BB^{\mathrm{T}})y_{k}$ further by the chain rule and (49),

[TABLE]

Since $H=AX+A_{\perp}Z$ for some $X^{\mathrm{T}}=-X$ and $y_{k}\in\mathcal{R}(A)$ , this implies further

[TABLE]

For $k\not\in\mathcal{K}$ , the one-sided derivative in direction $H$ is the inner product of $\nabla E_{k}$ and $H$ so that

[TABLE]

and using the structure of $H$ again, this implies

[TABLE]

∎

Under certain conditions, it is possible to formulate a minimality condition also for matrices in the anchor set.

Theorem 6.3.

Let $y_{i}\in\mathbb{R}^{d}$ , $i=1,\ldots,N$ and $A\in S_{d,K}$ , where $K\leq d$ . Let $\mathcal{K}\coloneqq\{k\in\{1,\ldots,N\}\mathrel{\mathop{\mathchar 58\relax}}\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{k}\|=0\}$ have cardinality $\kappa\geq 1$ . Assume that the matrix $Y_{\mathcal{K}}\coloneqq(y_{k})_{k\in\mathcal{K}}\in\mathbb{R}^{d,\kappa}$ has rank $m\leq K$ , where the columns are ordered so that the first $m$ are linearly independent and denoted by $Y_{m}$ and the other ones are their multiples, i.e., $Y_{\mathcal{K}}=(Y_{m}\,|\,Y_{m}D)$ , where $D\in\mathbb{R}^{m,\kappa-m}$ is a matrix whose columns contain exactly one nonzero entry. Then $A\in S_{d,K}$ is a strict local minimizer of $E$ on $S_{d,K}$ if and only if the following two conditions are fulfilled

[TABLE]

where the absolute value of $D$ has to be understood componentwise, $1_{\kappa-m}$ denotes the vector with $\kappa-m$ entries one, and $(A^{\mathrm{T}}Y_{m})_{\perp}\in\mathbb{R}^{K,K-m}$ is any matrix of rank $K-m$ which columns are orthogonal to those of $A^{\mathrm{T}}Y_{m}\in\mathbb{R}^{K,m}$ .

If $Y_{\mathcal{K}}$ contains only vectors which are multiples of $y_{1}\in\mathbb{R}^{d}$ , then $A\in S_{d,K}$ is a strict local minimizer of $E$ on $S_{d,K}$ if and only if the following conditions are fulfilled

[TABLE]

Proof.

By Theorem 6.1 and (50), $A$ is a strict local minimizer of $E$ on $S_{d,K}$ if and only if

[TABLE]

for all $Z\in\mathbb{R}^{d-K,K}$ . Replacing $Z$ by $Z\begin{pmatrix}Y_{m}^{\mathrm{T}}A\\ (Y_{m}^{\mathrm{T}}A)_{\perp}\end{pmatrix}$ , where $(Y_{m}^{\mathrm{T}}A)_{\perp}=(A^{\mathrm{T}}Y_{m})_{\perp}^{\mathrm{T}}$ , this is equivalent to

[TABLE]

for all $Z\in\mathbb{R}^{d-K,K}$ . Clearly, the condition is fulfilled if and only if

[TABLE]

for all $Z_{m}\in\mathbb{R}^{d-K,m}$ . The second part implies that the columns of $C_{A,\mathcal{K}}A(A^{\mathrm{T}}Y_{m})_{\perp}$ are in the range of $A$ which gives the second condition in (51).

Now, $\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y_{k}\|=0$ implies $AA^{\mathrm{T}}y_{k}=y_{k}$ , $k\in\mathcal{K}$ so that the first condition becomes

[TABLE]

Using $Y_{\mathcal{K}}=(Y_{m}\,|\,Y_{m}D)$ and the definition of $\|\cdot\|_{2,1}$ , the right-hand side can be rewritten as

[TABLE]

so that the condition can be rewritten as

[TABLE]

for all $Z_{m}\in\mathbb{R}^{d-K,m}$ . By (48) this is fulfilled if and only if

[TABLE]

Using $\|P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}y\|=\|A_{\perp}^{\mathrm{T}}y\|$ for all $y\in\mathbb{R}^{d}$ , this gives the assertion (51).

Assume that the columns of $Y_{\mathcal{K}}$ are multiples of $y_{1}$ . Then the first condition of the simplification (52) follows immediately from

[TABLE]

Since for every $x\in\mathbb{R}^{K}$ , $x\not=0$ the columns of $I_{K}-xx^{\mathrm{T}}/\|x\|^{2}$ span the linear space orthogonal to $x$ , the second condition can be deduced using $P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}\,C_{A,\mathcal{K}}\,A\left(I_{K}-A^{\mathrm{T}}y_{1}y_{1}^{\mathrm{T}}A/\|y_{1}\|^{2}\right)=0_{d,K}$ and $AA^{\mathrm{T}}y_{1}=y_{1}$ . ∎

Remark 6.4.

For more general cases than those considered in Theorem 6.3 there is no simple optimality condition for an anchor point $A$ to be a local minimizer, since basically all possible descent direction $H$ need to be checked in (50). However, if

[TABLE]

then $H\coloneqq P^{\scriptscriptstyle\kern-1.0pt\bot\kern-1.0pt}_{A}C_{A,\mathcal{K}}A$ is a descent direction since

[TABLE]

Hence, we can use a line search method to find a next iterate in these anchor points. Note that in comparison to the gradient condition used in [30, Theorem 4], ours can be easily verified numerically.

Acknowledgments

Part of this research was performed while the author was visiting the Institute for Pure and Applied Mathematics (IPAM), which is supported by the National Science Foundation. Funding by the German Research Foundation (DFG) within the Research Training Group 1932, project area P3, is gratefully acknowledged. We thank the anonymous reviewer for pointing to the retraction approach in Section 6.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds . Princeton and Oxford, Princeton University Press, 2008.
2[2] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research , 35(2):438–457, 2010.
3[3] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming , 137(1-2, Ser. A):91–129, 2013.
4[4] A. Beck and S. Sabach. Weiszfeld’s method: old and new results. Journal of Optimization Theory and Applications , 164(1):1–40, 2015.
5[5] A. Ben-Tal and J. Zowe. Directional derivatives in nonsmooth optimization. Journal of Optimization Theory and Applications , 47(4):483–490, 1985.
6[6] E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM , 58(3):11, 2011.
7[7] E. Chouzenoux, J. Idier, and S. Moussaoui. A majorize - minimize strategy for subspace optimization applied to image restoration. IEEE Transactions on Image Processing , 20:1517–1528, 2011.
8[8] C. Ding, D. Zhou, X. He, and H. Zha. R 1 subscript 𝑅 1 R_{1} -PCA: rotational invariant L 1 subscript 𝐿 1 L_{1} -norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd international conference on Machine learning , pages 281–288. ACM, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On the Rotational Invariant L1L_{1}L1​-Norm PCA

Abstract

1 Introduction

2 Preliminaries on Stiefel and Grassmannian Manifolds

Lemma 2.1**.**

3 Model Analysis

Lemma 3.1**.**

Proof.

Lemma 3.2**.**

Proof.

Lemma 3.3**.**

Proof.

4 Minimization Algorithm

Constrained Minimization Algorithm.

Conditional Gradient Algorithm.

Theorem 4.1**.**

Corollary 4.2**.**

Proof.

Remark 4.3**.**

Gradient Descent Algorithm on Gd,KG_{d,K}Gd,K​.

Remark 4.4**.**

Lemma 4.5**.**

Proof.

Lemma 4.6**.**

Proof.

5 Convergence Analysis

Theorem 5.1**.**

Corollary 5.2**.**

Lemma 5.3**.**

Proof.

Corollary 5.4**.**

Proof.

Lemma 5.5**.**

Proof.

Theorem 5.6**.**

Proof.

6 Investigation of Anchor Points

Theorem 6.1**.**

Lemma 6.2**.**

Proof.

Theorem 6.3**.**

Proof.

Remark 6.4**.**

Acknowledgments

On the Rotational Invariant $L_{1}$ -Norm PCA

Lemma 2.1.

Lemma 3.1.

Lemma 3.2.

Lemma 3.3.

Theorem 4.1.

Corollary 4.2.

Remark 4.3.

Gradient Descent Algorithm on $G_{d,K}$ .

Remark 4.4.

Lemma 4.5.

Lemma 4.6.

Theorem 5.1.

Corollary 5.2.

Lemma 5.3.

Corollary 5.4.

Lemma 5.5.

Theorem 5.6.

Theorem 6.1.

Lemma 6.2.

Theorem 6.3.

Remark 6.4.