On the Robust PCA and Weiszfeld's Algorithm

Sebastian Neumayer; Max Nimmer; Simon Setzer; Gabriele Steidl

arXiv:1902.04292·math.NA·February 13, 2019

On the Robust PCA and Weiszfeld's Algorithm

Sebastian Neumayer, Max Nimmer, Simon Setzer, Gabriele Steidl

PDF

Open Access

TL;DR

This paper introduces a robust PCA method based on minimizing Euclidean distances to data points using a Weiszfeld-like algorithm, effectively handling outliers and ensuring convergence to critical points.

Contribution

It develops a novel Weiszfeld-like algorithm for robust PCA that carefully manages anchor directions and proves its global convergence.

Findings

01

Algorithm demonstrates excellent performance in numerical tests.

02

Handles anchor directions with careful mathematical treatment.

03

Proven convergence to critical points under Kurdyka-Łojasiewicz property.

Abstract

Principal component analysis (PCA) is a powerful standard tool for reducing the dimensionality of data. Unfortunately, it is sensitive to outliers so that various robust PCA variants were proposed in the literature. This paper addresses the robust PCA by successively determining the directions of lines having minimal Euclidean distances from the data points. The corresponding energy functional is not differentiable at a finite number of directions which we call anchor directions. We derive a Weiszfeld-like algorithm for minimizing the energy functional which has several advantages over existing algorithms. Special attention is paid to the careful handling of the anchor directions, where we take the relation between local minima and one-sided derivatives of Lipschitz continuous functions on submanifolds of $R^{d}$ into account. Using ideas for stabilizing the classical Weiszfeld…

Figures30

Click any figure to enlarge with its caption.

Equations227

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥^{2} .

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥^{2} .

\hat{A} \in A \in S_{d, K} arg min i = 1 \sum N ∥ P_{A} y_{i} ∥^{2},

\hat{A} \in A \in S_{d, K} arg min i = 1 \sum N ∥ P_{A} y_{i} ∥^{2},

\overset{a}{^}_{k + 1}

\overset{a}{^}_{k + 1}

\overset{a}{^}_{k + 1}

\overset{a}{^}_{k + 1}

L, S arg min ∥ L ∥_{*} + λ ∥ S ∥_{1} \mbox s u bj ec tt o Y = L + S,

L, S arg min ∥ L ∥_{*} + λ ∥ S ∥_{1} \mbox s u bj ec tt o Y = L + S,

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥.

(\hat{A}, \hat{b}) \in A \in R^{d, K}, b \in R^{d} arg min i = 1 \sum N t \in R^{K} min ∥ A t + b - x_{i} ∥.

\hat{A} \in A \in S_{d, K} arg min i = 1 \sum N ∥ P_{A} y_{i} ∥.

\hat{A} \in A \in S_{d, K} arg min i = 1 \sum N ∥ P_{A} y_{i} ∥.

\overset{a}{^}_{k + 1}

\overset{a}{^}_{k + 1}

\overset{a}{^}_{k + 1}

\overset{a}{^}_{k + 1}

\overset{x}{^} := x arg min E (x) := x arg min i = 1 \sum N ∥ x - x_{i} ∥.

\overset{x}{^} := x arg min E (x) := x arg min i = 1 \sum N ∥ x - x_{i} ∥.

0\in\partial\mathcal{E}(\hat{x})=\left\{\begin{array}[]{ll}\nabla\mathcal{E}(\hat{x})=\sum\limits_{i=1}^{N}\frac{\hat{x}-x_{i}}{\|\hat{x}-x_{i}\|}&\mbox{if}\;\hat{x}\not\in\mathcal{A},\\[4.30554pt] \sum\limits_{{i=1\atop x_{i}\not=\hat{x}}}^{N}\frac{\hat{x}-x_{i}}{\|\hat{x}-x_{i}\|}+\overline{B_{1}(0)}&\mbox{if}\;\hat{x}\in\mathcal{A}.\end{array}\right.

0\in\partial\mathcal{E}(\hat{x})=\left\{\begin{array}[]{ll}\nabla\mathcal{E}(\hat{x})=\sum\limits_{i=1}^{N}\frac{\hat{x}-x_{i}}{\|\hat{x}-x_{i}\|}&\mbox{if}\;\hat{x}\not\in\mathcal{A},\\[4.30554pt] \sum\limits_{{i=1\atop x_{i}\not=\hat{x}}}^{N}\frac{\hat{x}-x_{i}}{\|\hat{x}-x_{i}\|}+\overline{B_{1}(0)}&\mbox{if}\;\hat{x}\in\mathcal{A}.\end{array}\right.

\overset{x}{^}

\overset{x}{^}

\displaystyle=\hat{x}-\Big{(}\sum_{i=1}^{N}\frac{1}{\|\hat{x}-x_{i}\|}\Big{)}^{-1}\sum_{i=1}^{N}\frac{\hat{x}-x_{i}}{\|\hat{x}-x_{i}\|},

∥ x _{i} \neq = x ^ i = 1 \sum N \frac{x ^ - x _{i}}{∥ x ^ - x _{i} ∥} ∥ \leq 1.

∥ x _{i} \neq = x ^ i = 1 \sum N \frac{x ^ - x _{i}}{∥ x ^ - x _{i} ∥} ∥ \leq 1.

x^{(r+1)}=x^{(r)}-\underbrace{\Big{(}\sum_{i=1}^{N}\frac{1}{\|x^{(r)}-x_{i}\|}\Big{)}^{-1}}_{s_{r}^{-1}}\underbrace{\sum_{i=1}^{N}\frac{x^{(r)}-x_{i}}{\|x^{(r)}-x_{i}\|}}_{\nabla\mathcal{E}(x^{(r)})}.

x^{(r+1)}=x^{(r)}-\underbrace{\Big{(}\sum_{i=1}^{N}\frac{1}{\|x^{(r)}-x_{i}\|}\Big{)}^{-1}}_{s_{r}^{-1}}\underbrace{\sum_{i=1}^{N}\frac{x^{(r)}-x_{i}}{\|x^{(r)}-x_{i}\|}}_{\nabla\mathcal{E}(x^{(r)})}.

x^{(r+1)}:=x^{(r)}-\Big{(}\sum\limits_{i=1\atop i\not=k}^{N}\frac{1}{\|\hat{x}-x_{i}\|}\Big{)}^{-1}\left(1-\frac{1}{\|G_{k}\|}\right)G_{k}

x^{(r+1)}:=x^{(r)}-\Big{(}\sum\limits_{i=1\atop i\not=k}^{N}\frac{1}{\|\hat{x}-x_{i}\|}\Big{)}^{-1}\left(1-\frac{1}{\|G_{k}\|}\right)G_{k}

\operatorname*{arg\,min}_{\|a\|=1}\sum_{i=1}^{N}\varphi\big{(}\|P_{a}z_{i}\|^{2}\big{)}\;\in\;\mathrm{span}\{z_{i}:i=1,\ldots,N\}

\operatorname*{arg\,min}_{\|a\|=1}\sum_{i=1}^{N}\varphi\big{(}\|P_{a}z_{i}\|^{2}\big{)}\;\in\;\mathrm{span}\{z_{i}:i=1,\ldots,N\}

a = \frac{a ~ + a ~ _{⊥}}{∥ a ~ + a ~ _{⊥} ∥},

a = \frac{a ~ + a ~ _{⊥}}{∥ a ~ + a ~ _{⊥} ∥},

∥ P_{a} z ∥^{2} = ∥ z ∥^{2} - ⟨ a, y ⟩^{2} = ∥ z ∥^{2} - \frac{⟨ a ~ , z ⟩ ^{2}}{∥ a ~ ∥ ^{2} + ∥ a ~ _{⊥} ∥ ^{2}} \geq ∥ z ∥^{2} - \frac{⟨ a ~ , z ⟩ ^{2}}{∥ a ~ ∥ ^{2}}

∥ P_{a} z ∥^{2} = ∥ z ∥^{2} - ⟨ a, y ⟩^{2} = ∥ z ∥^{2} - \frac{⟨ a ~ , z ⟩ ^{2}}{∥ a ~ ∥ ^{2} + ∥ a ~ _{⊥} ∥ ^{2}} \geq ∥ z ∥^{2} - \frac{⟨ a ~ , z ⟩ ^{2}}{∥ a ~ ∥ ^{2}}

E (a) := i = 1 \sum N E_{i} (a) = i = 1 \sum N ∥ P_{a} y_{i} ∥.

E (a) := i = 1 \sum N E_{i} (a) = i = 1 \sum N ∥ P_{a} y_{i} ∥.

A := {\pm \frac{y _{i}}{∥ y _{i} ∥} : i = 1, \dots, N}

A := {\pm \frac{y _{i}}{∥ y _{i} ∥} : i = 1, \dots, N}

\nabla E (a) = - P_{a} C_{a} a, C_{a} := i = 1 \sum N \frac{1}{∥ P _{a} y _{i} ∥} y_{i} y_{i}^{T},

\nabla E (a) = - P_{a} C_{a} a, C_{a} := i = 1 \sum N \frac{1}{∥ P _{a} y _{i} ∥} y_{i} y_{i}^{T},

∥ G_{a, K} ∥ < k \in K \sum ∥ y_{k} ∥,

∥ G_{a, K} ∥ < k \in K \sum ∥ y_{k} ∥,

G_{a, K} := P_{a} C_{a, K} a, C_{a, K} := i \neq \in K \sum \frac{1}{∥ P _{a} y _{i} ∥} y_{i} y_{i}^{T} .

G_{a, K} := P_{a} C_{a, K} a, C_{a, K} := i \neq \in K \sum \frac{1}{∥ P _{a} y _{i} ∥} y_{i} y_{i}^{T} .

∣ E_{i} (a_{1}) - E_{i} (a_{2}) ∣

∣ E_{i} (a_{1}) - E_{i} (a_{2}) ∣

\leq ∥ a_{1} a_{1}^{T} - a_{2} a_{2}^{T} ∥_{F} ∥ y_{i} ∥

= \frac{1}{2} ∥ (a_{1} - a_{2}) (a_{1}^{T} + a_{2}^{T}) + (a_{1} + a_{2}) (a_{1}^{T} - a_{2}^{T}) ∥_{F} ∥ y_{i} ∥

\leq 2 (∥ a ∥ + ε) ∥ y_{i} ∥∥ (a_{1} - a_{2}) ∥_{F} .

\nabla E_{i} (a)

\nabla E_{i} (a)

D E_{k} (a; h)

D E_{k} (a; h)

= α ↓ 0 lim \frac{∥ α a h ^{T} y _{k} + α h a ^{T} y _{k} + α ^{2} h h ^{T} y _{k} ∥}{α}

= ∥ (a h^{T} + h a^{T}) y_{k} ∥ = ∥ h a^{T} y_{k} ∥ = ∥ h ∥∥ y_{k} ∥.

D E_{i} (a; h)

D E_{i} (a; h)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Statistical Methods and Models · Face and Expression Recognition

Full text

\setkomafont

sectioning \setkomafonttitle \setkomafontdescriptionlabel

On the Robust PCA and Weiszfeld’s Algorithm

Sebastian Neumayer111Department of Mathematics, Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, D-67663 Kaiserslautern, Germany, {nimmer,steidl}@mathematik.uni-kl.de.

Max Nimmer111Department of Mathematics, Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, D-67663 Kaiserslautern, Germany, {nimmer,steidl}@mathematik.uni-kl.de.

Simon Setzer333Engineers Gate, London, United Kingdom

Gabriele Steidl111Department of Mathematics, Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, D-67663 Kaiserslautern, Germany, {nimmer,steidl}@mathematik.uni-kl.de. 222Fraunhofer ITWM, Fraunhofer-Platz 1, D-67663 Kaiserslautern, Germany

Abstract

The principal component analysis (PCA) is a powerful standard tool for reducing the dimensionality of data. Unfortunately, it is sensitive to outliers so that various robust PCA variants were proposed in the literature. This paper addresses the robust PCA by successively determining the directions of lines having minimal Euclidean distances from the data points. The corresponding energy functional is not differentiable at a finite number of directions which we call anchor directions. We derive a Weiszfeld-like algorithm for minimizing the energy functional which has several advantages over existing algorithms. Special attention is paid to the careful handling of the anchor directions, where we take the relation between local minima and one-sided derivatives of Lipschitz continuous functions on submanifolds of $\mathbb{R}^{d}$ into account. Using ideas for stabilizing the classical Weiszfeld algorithm at anchor points and the Kurdyka–Łojasiewicz property of the energy functional, we prove global convergence of the whole sequence of iterates generated by the algorithm to a critical point of the energy functional. Numerical examples demonstrate the very good performance of our algorithm.

1 Introduction

Principal component analysis (PCA) [41] is an important tool for dimensionality reduction of data which is often applied as a pre-processing step, e.g., for classification or segmentation. The procedure provides dimensionality reduction by projecting the data onto a linear subspace maximizing the variance of the projection or, equivalently minimizing the squared Euclidean distance error to the subspace. More precisely, let $N\geq d$ data points $x_{1},\ldots,x_{N}\in\mathbb{R}^{d}$ be given. By $\|\cdot\|$ we denote the Euclidean norm and by $I_{d}$ the $d\times d$ identity matrix. PCA finds a $K$ -dimensional affine subspace $\{\hat{A}\,t+\hat{b}:t\in\mathbb{R}^{K}\}$ , $1\leq K\leq d$ , having smallest squared Euclidean distance from the data:

[TABLE]

While $\hat{A}$ and $\hat{b}$ in the above minimization problem are not unique, the affine subspace itself is uniquely determined if the empirical covariance matrix only has eigenvalues of multiplicity one, and goes through the offset(bias) $\bar{b}:=\frac{1}{N}(x_{1}+\ldots+x_{N})$ . Therefore we can reduce our attention to data points $y_{i}:=x_{i}-\bar{b}$ , $i=1,\ldots,N$ and subspaces through the origin minimizing the squared Euclidean distances to the $y_{i}$ , $i=1,\ldots,N$ . Setting further the gradient of the inner function in (1) with respect to $t\in\mathbb{R}^{K}$ to zero and adding the constraint of $A$ being in the Stiefel manifold $\mathbb{S}_{d,K}=\{A\in\mathbb{R}^{d,K}:\ A^{\mathrm{T}}A=I_{K}\}$ to eliminate some redundancies the problem reduces to

[TABLE]

where $P_{A}:=I_{d}-AA^{\mathrm{T}}$ denotes the orthogonal projection onto $\mathcal{R}(A)^{\perp}=\mathcal{N}(A^{\mathrm{T}})$ .

One of the most important properties of PCA is the nestedness of the PCA subspaces, i.e., for $K<\tilde{K}\leq d$ , the optimal $K$ -dimensional PCA subspace is contained in the $\tilde{K}$ -dimensional one. In particular, the directions forming the columns of $\hat{A}=(\hat{a}_{1}\;\ldots\;\hat{a}_{K})$ can be found successively by computing for $k=0,1,\ldots,K-1$ ,

[TABLE]

or equivalently

[TABLE]

where $\hat{A}_{k}:=(\hat{a}_{1}\;a_{2}\;\ldots\;\hat{a}_{k})$ , $k=1,2,\ldots,K-1$ , and $P_{\hat{A}_{0}}=I_{d}$ , see, e.g., [13]. The first problem (3) focuses on the minimization of the residual, while the second one (4) underlines the maximization of the variance in the PCA direction.

Unfortunately, PCA is sensitive to outliers in the data, see Fig. 1. One possibility to circumvent the problem is to remove outliers before computing the principal components. However, in some contexts, outliers are difficult to identify and other data points are incorrectly given outlier status forcing a large number of deletions before a reliable estimate can be found.

Therefore, quite different methods were proposed in the literature to make PCA robust, in particular in robust statistics, see the books [16, 33, 28]. One approach consists in assigning different weights to data points based on their estimated relevance, to get a weighted PCA, see, e.g. [21]. The RANSAC algorithm [9] repeatedly estimates the model parameters from a random subset of the data points until a satisfactory result is obtained as indicated by the number of data points within a certain error threshold. In a similar vein, least trimmed squares PCA models [45, 44] aim to exclude outliers from the squared error functional, but in a deterministic way. Another possible approach is to minimize the median of the squared errors as in [34].

The variational model of Candes et al. [5] decomposes the data matrix $Y=(y_{1}\;\ldots\;y_{N})$ into a low rank and a sparse part by minimizing

[TABLE]

exploiting the nuclear norm $\|L\|_{*}$ of $L$ and the sum of the absolute values of entries $\|S\|_{1}$ . Then $L$ can be considered as robust part, while $S$ addresses the outliers. Related approaches as [35, 52] separate the low rank component from the column sparse one using similar norms.

Another group of robust PCA approaches replaces the squared $\ell_{2}$ norm in PCA by the $\ell_{1}$ norm. Then the minimization of the energy functional can be addressed by linear programming, see, e.g., Ke and Kanade [19]. Unfortunately, this norm is not rotationally invariant.

Mathematically interesting approaches follow (1) - (4), but skip the squares in the Euclidean distances and the inner products to find more robust directions. Taking pure Euclidean distances has several consequences. First of all, the energy functionals become non-differentiable at a finite number of subspaces spanned by matrices $A$ , resp. directions $a$ , which we collect within the so-called anchor set. Further, the offset $\hat{b}$ in

[TABLE]

cannot be simply determined, see our small discussion in Section 5. Let us assume that an offset $\hat{b}$ is given, so that we can restrict our attention to the data $y_{i}=x_{i}-\hat{b}$ $i=1,\ldots,N$ and (5) becomes

[TABLE]

Even then we lose the nested subspace property of the classical PCA, so that in particular

[TABLE]

and

[TABLE]

have in general nothing to do with the columns of the matrix $\hat{A}$ obtained in (6). Finally, the residual minimizing point of view (7) leads to different results than the variance maximizing one in (8), see Fig. 2.

The models (6) - (8) were considered in the literature. The maximization of (8) was suggested with more general scalable functions than just the absolute value by Huber [15, p. 203] and studied in detail as PP-PCA by Li and Chen [29]. It was reinvented and tackled with a greedy algorithm in [25] and the greedy algorithm was made more robust using median computations in [14]. For other methods in this direction, see also [35, 39]. In [14] it was pointed out that the variance maximizing method in [25] lacks a certain robustness since it still involves mean computations. This was already demonstrated in Fig. 2.

Model (6) was treated by Ding et al. [7], where the authors circumvented the anchor set by smoothing the original energy functional. The paper gives no convergence analysis of the proposed algorithm. A tight convex relaxation approach for (6), called REAPER was suggested in [27]. The relaxation replaces the condition that the symmetric positive semidefinite matrix $AA^{\mathrm{T}}$ has eigenvalues in $\{0,1\}$ by the condition of eigenvalues in $[0,1]$ . This blows the problem size up. Numerically the relaxed problem can be solved via an iteratively re-weighted least squares algorithm. Usually this requires again a smoothing of the relaxed convex, but still non-differentiable functional.

In this paper, we are interested in the residual minimizing approach (7). Recently, a minimization algorithm was published by Keeling and Kunisch [20]. Local convergence of their algorithm to a local minimizer was proved if the two parameters within the algorithm are chosen appropriately without a concrete specification of their range. The outcome of the algorithm is very sensitive to the choice of the parameters. We propose a minimization algorithm which is completely different from those in [20]. It is based on ideas of the classical Weiszfeld algorithm [51] for computing the geometric median of points in $\mathbb{R}^{d}$ and has the advantage that no parameters have to be tuned. In non-anchor directions the algorithm can be considered as gradient descent algorithm on the sphere, where the length of the gradient descent is automatically given. The treatment of anchor directions relies on one-sided directional derivatives of the energy functional. We show that such derivatives can be used to characterize local minima on submanifolds of $\mathbb{R}^{d}$ of locally Lipschitz continuous functions which is interesting on its own. We prove global convergence of our algorithm to a critical point of the energy functional, where we take special care of the anchor set.

Outline of the paper

In the next Section 2, we recall the Weiszfeld algorithm for computing the geometric median of given data points. Properties of the energy function, critical point conditions and the minimization algorithm are developed in Section 3. The main part of the paper is the convergence analysis of our algorithm in Section 4. Some remarks on the offset are given in Section 5. Numerical examples demonstrate the performance of our algorithm in Section 6. The paper ends with conclusions and ideas for future work in Section 7. The Appendix A provides a criterion for determining local minimizers of locally Lipschitz continuous functions on embedded manifolds in $\mathbb{R}^{d}$ which is applied in Section 3.

2 Weiszfeld’s Algorithm for Geometric Median Computation

We start with a small review of Weiszfeld’s algorithm with two aims: first, the geometric median usually replaces the mean as offset in robust PCA methods. Second, having the original Weiszfeld algorithm in mind helps to understand the basic intention of our algorithm for minimizing (7).

The geometric median $\hat{x}\in\mathbb{R}^{d}$ of pairwise distinct points $x_{i}\in\mathbb{R}^{d}$ , $i=1,\ldots,N$ , which are not aligned, is uniquely determined by

[TABLE]

An efficient algorithm for solving the geometric median problem is the Weiszfeld algorithm which goes back to the Hungarian mathematician A. Vazsonyi (Weiszfeld) [50, 51] and can be also seen as a special maximizing-minimizing algorithm, see, e.g. [6]. In [22, 23] it was recognized that the original algorithm of Weiszfeld fails if an iterate produced by the algorithm belongs to the so-called anchor set $\mathcal{A}:=\{x_{1},\ldots,x_{N}\}$ consisting of the points where $\mathcal{E}$ is not differentiable. For bypassing the anchor points the most natural way is to define an appropriate descent direction of $\mathcal{E}$ in those points [40, 49]. To derive the algorithm recall that the function $\mathcal{E}$ is convex and by Fermat’s rule the vector $\hat{x}\in\mathbb{R}^{d}$ is a the minimizer of $\mathcal{E}$ if and only if

[TABLE]

where $\partial\mathcal{E}$ denotes the subdifferential of $\mathcal{E}$ and $\overline{B_{1}(0)}$ the closed Euclidean ball around zero with radius 1. Thus, a minimizer $\hat{x}\not\in\mathcal{A}$ has to fulfill the fixed point equation

[TABLE]

while $\hat{x}\in\mathcal{A}$ is a minimizer if and only if

[TABLE]

The Weiszfeld algorithm is an iterative algorithm which produces a sequence $\{x^{(r)}\}_{r}$ as follows: if $x^{(r)}\not\in\mathcal{A}$ , then we apply the Picard iteration belonging to (9),

[TABLE]

This is a gradient descent step with special step size $s_{r}^{-1}$ . If $x^{(r)}\in\mathcal{A}$ , i.e. $x^{(r)}=x_{k}$ for some $k\in\{1,\ldots,N\}$ and fulfills the minimality condition (10), then the algorithm stops; otherwise we perform a descent step in direction of the subgradient in $\partial{\mathcal{E}}(x^{(r)})$ which is closest to zero

[TABLE]

where $G_{k}:=\sum\limits_{{i=1\atop i\not=k}}^{N}\frac{x_{k}-x_{i}}{\|x_{k}-x_{i}\|}\in\partial{\mathcal{E}}(x_{k})$ .

Local and asymptotic convergence rates of the Weiszfeld algorithm were given in [18] and a non-asymptotic sublinear convergence rate was proved in [3]. The very good performance of Weiszfeld’s algorithm in comparison with the parallel proximal point algorithm was shown in [47] and a projected Weiszfeld algorithm was established in [38]. Keeling and Kunisch [20] suggested another stable algorithm for finding the geometric mean based on criticizing the behavior of the original Weiszfeld algorithm in anchor points and not taking its stabilized versions into account. A good reference on past and ongoing research in this direction is [3] and the references therein.

3 Weiszfeld-like Algorithm for Robust PCA

We consider the minimization approach (7). First of all we see in the next remark that the direction $a_{k+1}$ is indeed perpendicular to the previous directions $\{a_{1},\ldots,a_{k}\}$ .

Remark 3.1.

Let $\varphi:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}$ be a strictly increasing function. In our application we are interested in $\varphi(x)=x^{\frac{1}{2}}$ . For any $z_{i}\in\mathbb{R}^{d}$ , $i=1,\ldots,N$ its holds

[TABLE]

by the following reasons: Every $a\in\mathbb{R}^{d}$ with $\|a\|=1$ can be written as

[TABLE]

where $\tilde{a}\in\mathrm{span}\{z_{i}:i=1,\ldots,N\}$ and $\tilde{a}_{\perp}$ is in the orthogonal complement of $\mathrm{span}\{z_{i}:i=1,\ldots,N\}$ . Then we have for every $z\in\mathrm{span}\{z_{i}:i=1,\ldots,N\}$ ,

[TABLE]

with equality if $\|\tilde{a}_{\perp}\|=0$ . Since $\varphi$ is strictly increasing, any minimizer $\hat{a}$ must be in $\mathrm{span}\{z_{i}:i=1,\ldots,N\}$ . $\Box$

We have to deal with the function

[TABLE]

This function is continuously differentiable on $\mathbb{R}^{d}$ except for $a\in\mathbb{R}^{d}$ satisfying $\|P_{a}y_{k}\|=\|(I_{d}-aa^{\mathrm{T}})y_{k}\|=0$ for some $k\in\{1,\ldots,N\}$ . This is equivalent to $y_{k}=a\langle a,y_{k}\rangle$ and for $a\in\mathbb{S}^{d-1}$ to $a\in\{\pm\frac{y_{k}}{\|y_{k}\|}\}$ . Let

[TABLE]

denote this set of directions on $\mathbb{S}^{d-1}$ , where $E$ is not differentiable. Similarly as in Weiszfeld’s algorithm, we call it anchor set.

The following theorem collects important properties of $E$ . The third property relies on the relation between one-sided derivatives and local minima of Lipschitz continuous functions on embedded manifolds in $\mathbb{R}^{d}$ . The definition of one-sided derivatives and a theorem local minima can be characterized by is given in Appendix A. In our case the embedded manifold is the sphere $\mathbb{S}^{d-1}\coloneqq\{a\in\mathbb{R}^{d}:\ \|a\|=1\}$ .

Theorem 3.2.

Let $E$ defined by (12).

The function $E$ is locally Lipschitz continuous on $\mathbb{R}^{d}$ . 2. 2.

For $a\in\mathbb{S}^{d-1}\backslash\mathcal{A}$ , it holds

[TABLE]

and $\nabla E(a)$ is in the tangent space $T_{a}\mathbb{S}^{d-1}$ of $\mathbb{S}^{d-1}$ at $a$ . 3. 3.

A direction $a\in\mathcal{A}$ is a local minimizer of $E$ if

[TABLE]

where $\mathcal{K}:=\{k\in\{1,\ldots,N\}:\|P_{a}y_{k}\|=0\}$ and

[TABLE]

Proof.

It suffices to show the property for the summands $E_{i}$ . For an arbitrary fixed $a\in\mathbb{R}^{d}$ , let $\|a-a_{i}\|\leq\varepsilon$ , $i=1,2$ . Then we obtain

[TABLE]

By straightforward computation we obtain at points $a\in\mathbb{R}^{d}$ , where $E$ is differentiable,

[TABLE]

The second summand vanishes for $a\in\mathbb{S}^{d-1}$ which yields (13). Since $P_{a}$ projects to the space orthogonal to $a$ the gradient $\nabla E(a)$ lies in $T_{a}\mathbb{S}^{d-1}$ .

For $a\in\mathcal{A}$ we have $a\in\pm\frac{y_{k}}{\|y_{k}\|}$ for $k\in\mathcal{K}$ . Then the one-sided directional derivative of $E_{k}$ at $a\in\mathcal{A}$ in direction $h\in T_{a}S^{d-1}$ reads as

[TABLE]

For $i\not\in\mathcal{K}$ we have by part 2 of the proof

[TABLE]

so that in summary

[TABLE]

Since $E$ is locally Lipschitz continuous on $\mathbb{R}^{d}$ , we conclude by Theorem A.2 that $a\in\mathcal{A}$ is a local minimizer if

[TABLE]

for all $h\in T_{a}\mathbb{S}^{d-1}$ . Since $P_{a}C_{a,\mathcal{K}}a\in T_{a}\mathbb{S}^{d-1}$ this equivalent to

[TABLE]

∎

To establish a Weiszfeld-like algorithm, we consider again two cases:

If $a\not\in\mathcal{A}$ , then $0=\nabla E(a)=-P_{a}C_{a}a$ can be rewritten as the fixed point equation

[TABLE]

This gives rise to the gradient descent step on $\mathbb{S}^{d-1}$ :

[TABLE]

where the factor $s_{a^{(r)}}$ cancels out when projecting on $\mathbb{S}^{d-1}$ . This also appears in the algorithm proposed by Ding et al. [7] from another point of view.

If $a\in\mathcal{A}$ and $\|G_{a,\mathcal{K}}\|>\sum_{k\in\mathcal{K}}\|y_{k}\|$ , then we suggest to use

[TABLE]

instead of the gradient as descent direction which results in the iteration

[TABLE]

with

[TABLE]

and subsequent orthogonal projection onto $\mathbb{S}^{d-1}$ . In summary, we obtain Algorithm 1.

4 Convergence Analysis

In this section, we show that that sequence generated by the Algorithm 1 converges to a critical point of $E$ , where we say that $a\in\mathbb{S}^{d-1}$ is a critical point of $E$ on $\mathbb{S}^{d-1}$ if one of the following conditions is fulfilled:

i)

$a\not\in\mathcal{A}$ and $-\nabla E(a)=P_{a}C_{a}a=0$ .

ii)

$a\not\in\mathcal{A}$ and $\|G_{a,\mathcal{K}}\|\leq\sum_{k\in\mathcal{K}}\|y_{k}\|$ .

We need four lemmata and apply a theorem of Attouch, Bolte and Svaiter [2] on the convergence of functions having the Kurdyka–Łojasiewicz property.

Lemma 4.1.

For the sequence $\{a^{(r)}\}_{r}$ produced by Algorithm 1 we have $a^{(r+1)}=a^{(r)}$ if and only if $a^{(r)}$ is a critical point of $E$ on $\mathbb{S}^{d-1}$ . If the iteration stops after finitely many steps, then it has reached a critical point.

Proof.

Let $a^{(r+1)}=a^{(r)}=a$ . If $a$ is not in the anchor set, this implies $\frac{C_{a}a}{\|C_{a}a\|}=a$ and hence $P_{a}C_{a}a=\|C_{a}a\|P_{a}a=0$ . If $a$ is in the anchor set, then relation in ii) must be fulfilled by the stopping condition.

Let $a^{(r)}\in\mathbb{S}^{d-1}$ be a critical point of $E$ . If $a^{(r)}$ is not in the anchor set, then by definition $0=P_{a^{(r)}}C_{a^{(r)}}a^{(r)}=C_{a^{(r)}}a^{(r)}-a^{(r)}\,(a^{(r)})^{\mathrm{T}}C_{a^{(r)}}a^{(r)}$ so that

[TABLE]

If $a^{(r)}$ is in the anchor set, then $\|G_{{a^{(r)}},\mathcal{K}}\|\leq\sum_{k\in\mathcal{K}}\|y_{k}\|$ and the iteration stops by definition, i.e. $a^{(r+1)}=a^{(r)}$ . ∎

Lemma 4.2.

Let $\{a^{(r)}\}_{r}$ be the sequence generated by Algorithm 1. If $a^{(r+1)}\not=a^{(r)}$ , then $E(a^{(r+1)})<E(a^{(r)})$ . The sequence $\{E(a^{(r)})\}_{r}$ converges to some value $\hat{E}\geq 0$ .

Proof.

If the sequence of function values decreases, its convergence follows immediately from the fact that E is bounded from below by zero. To show the decrease property, we set $a:=a^{(r)}$ , $\bar{a}:=a^{(r+\frac{1}{2})}$ and $\tilde{a}=a^{(r+1)}$ and abbreviate

[TABLE]

where $\mathcal{K}$ is the empty set if $a^{(r)}$ is not an anchor direction.

Case 1: Let $a\notin\mathcal{A}$ be a non-anchor direction. For $u\geq 0,v>0$ it holds $u-v\leq\frac{u^{2}-v^{2}}{2v}$ so that

[TABLE]

Using $\|u-v\|^{2}-\|w-v\|^{2}=2\langle u-w,u-v\rangle-\|u-w\|^{2}$ we get

[TABLE]

which finally implies

[TABLE]

Since $a,\tilde{a}\in\mathrm{span}(Y)$ the right-hand side is strictly negative except for $\tilde{a}=\pm a$ which was excluded.

Case 2: Let $a\in\mathcal{A}$ , i.e., $\|P_{a}y_{k}\|=0$ for $k\in\mathcal{K}\not=\emptyset$ and

[TABLE]

From $P_{a}y_{k}=0$ , $k\in\mathcal{K}$ , i.e., $y_{k}=a(a^{\mathrm{T}}y_{k})$ we obtain $\|y_{k}\|=|a^{\mathrm{T}}y_{k}|$ . Since $\bar{a}=a+S^{-1}\left(1-\frac{\alpha}{\|G\|}\right)G$ and $a\perp G$ we have

[TABLE]

We have to estimate

[TABLE]

First, we get for $k\in{\mathcal{K}}$ ,

[TABLE]

so that

[TABLE]

Replacing the sum over $\{1,\ldots,N\}$ in the first step of the proof by those over $\{1,\ldots,N\}\backslash\mathcal{K}$ we get instead of (24)

[TABLE]

By definition of $\bar{a}$ and (26) we can rewrite

[TABLE]

For the second sum in (30) we get

[TABLE]

Application of $a^{\mathrm{T}}CG=\|G\|^{2}$ and of the definition of $\tilde{a}$ leads to

[TABLE]

and

[TABLE]

Hence we obtain

[TABLE]

Since $C$ symmetric positive definite we conclude by Young’s inequality

[TABLE]

so that

[TABLE]

Combining this equation with (29), (30) and (35), and using that $\|\bar{a}\|>1$ , we obtain

[TABLE]

∎

Lemma 4.3.

Let $\{a^{(r)}\}_{r}$ be an infinite sequence generated by Algorithm 1. Then we have

[TABLE]

The set of accumulation points is compact and connected.

Proof.

Since the number of anchor directions is finite, we can choose $R$ large enough such that all iterates $a^{(r)}$ , $r\geq R$ are no anchor directions. Since the projection $\Pi_{\mathbb{S}^{d-1}}$ onto the unit sphere is non-expansive for points not in the interior of the unit ball, we obtain

[TABLE]

We show that all accumulation points of $\{\beta_{r}\}_{r}$ with $\beta_{r}:=\|a^{(r+1)}-a^{(r)}\|$ are zero. Note that such accumulation points exist, since $\mathbb{S}^{d-1}$ is compact so that the sequence is bounded from below and above. Let $\{\beta_{r_{j}}\}_{j}$ converge to $\hat{\beta}$ which is then also true for every subsequence. Let $\{\beta_{r_{j_{i}}}\}_{i}$ by any subsequence for which $\{a^{(r_{j_{i}})}\}_{i}$ converges to an accumulation point $\hat{a}$ . For simplicity of notation, we skip the second index $i$ . We distinguish two cases:

Let $\hat{a}\notin\mathcal{A}$ be a non-anchor direction. Then the update operator $T(a)=\frac{C_{a}a}{\|C_{a}a\|}$ of the algorithm is continuous in $\hat{a}$ so that $\lim_{j\to\infty}a^{(r_{j}+1)}=\lim_{j\to\infty}T(a^{(r_{j})})=T(\hat{a})$ . By Lemma 4.2 and continuity of $E$ , we get

[TABLE]

so that $\hat{a}=T(\hat{a})$ . This in turn yields $P_{\hat{a}}C_{\hat{a}}\hat{a}=\|C_{\hat{a}}\hat{a}\|P_{\hat{a}}\hat{a}=0$ . Since the $\|P_{a^{(r_{j})}}y_{i}\|$ are bounded from above, and since $a^{(r)}\in\mathrm{span}(Y)$ for all $r$ we conclude that $s_{a^{(r_{j})}}$ is bounded from below. Taking the continuity of the involved operators in $\hat{a}$ into account, this implies

[TABLE] 2. 2.

Let $\hat{a}\in\mathcal{A}$ be an anchor direction. Then it holds

[TABLE]

while

[TABLE]

so that

[TABLE]

This proves (39). By Ostrowski’s Theorem, the set of accumulation points of the sequence of iterates is compact and connected. ∎

Lemma 4.4.

Let $\hat{a}$ be an anchor direction. Let $T$ denote the iteration function of Algorithm 1. Then

[TABLE]

Proof.

For simplicity of notation, we assume that $\mathcal{K}=\{k\}$ and without loss of generality $\hat{a}=y_{k}/\|y_{k}\|$ . We set

[TABLE]

Similarly as in the proof of Lemma 4.3, Case 1, we have that $P_{a}C_{a}a$ is bounded from above and $\lim_{a\to\hat{a}}s_{a}=\infty$ so that $\lim_{a\to\hat{a}}\|T_{a}\|=1$ . We calculate

[TABLE]

The first term can be rearranged as

[TABLE]

By Taylor approximation of $\sqrt{1+x}$ at $x=0$ we get

[TABLE]

Plugging this into (46) yields

[TABLE]

In order to calculate the limit of this expression, we first consider

[TABLE]

and since

[TABLE]

finally

[TABLE]

The remainder of the Taylor approximation converges to zero as $\|P_{a}C_{a}a\|$ is bounded from above, while $s_{a}$ goes to infinity. Together with (47) this gives the limit of the first term,

[TABLE]

For the second term in (41) we calculate

[TABLE]

Now it is straightforward to check that

[TABLE]

so that by definition of $s_{a}$ the term (50) becomes

[TABLE]

Using that $G_{a,k}=P_{a}C_{a,k}a$ , $y_{k}=\hat{a}\|y_{k}\|$ and $P_{a}$ is an orthogonal projector, we can simplify

[TABLE]

so that

[TABLE]

As $\langle\frac{P_{a}y_{k}}{\|P_{a}y_{k}\|},G_{a,k}\rangle$ is bounded, $\lim_{a\to\hat{a}}a^{\mathrm{T}}y_{k}=\|y_{k}\|$ and $\lim_{a\to\hat{a}}\|T_{a}\|=1$ we get

[TABLE]

Plugging the results into (41) yields the assertion

[TABLE]

∎

Finally, we need the Kurdyka–Łojasiewicz property of functions [1]: The function $f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{+\infty\}$ with Fréchet limiting subdifferential $\partial f$ , see [36], is said to have the Kurdyka–Łojasiewicz (KL) property at $x^{*}\in\operatorname{dom}\partial f$ if there exist $\eta\in(0,+\infty)$ , a neighborhood $U$ of $x^{*}$ and a continuous concave function $\phi\colon[0,\eta)\to\mathbb{R}_{\geq 0}$ such that

$\phi(0)=0$ , 2. 2.

$\phi$ is $C^{1}$ on $(0,\eta)$ , 3. 3.

for all $s\in(0,\eta)$ it holds $\phi^{\prime}(s)>0$ , 4. 4.

for all $x\in U\cup[f(x^{*})<f<f(x^{*})+\eta]$ , the Kurdyka–Łojasiewicz inequality $\phi^{\prime}(f(x)-f(x^{*})){\mathrm{d}}(0,\partial f(x))\geq 1$ holds true.

A proper, lower semi-continuous (lsc) function which satisfies the KL property at each point of $\operatorname{dom}\partial f$ is called KL-function. Typical examples of KL functions are semi-algebraic functions. Fundamental works on this subject go back to Łojasiewicz [31] and Kurdyka [24].

The next theorem was proved by Bolte, Attouch and Svaiter [2, Theorem 2.9].

Theorem 4.5.

Let $f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{\infty\}$ be a KL function. Let $\{x^{(r)}\}_{r\in\mathbb{N}}$ be a sequence which fulfills the following conditions:

C1.

There exists $K_{1}>0$ such that $f(x^{(r+1)})-f(x^{(r)})\leq-K_{1}\|x^{(r+1)}-x^{(r)}\|^{2}$ for every $r\in\mathbb{N}$ . 2. C2.

There exists $K_{2}>0$ such that for every $r\in\mathbb{N}$ there exists $w_{r+1}\in\partial f(x^{(r+1)})$ with $\|w_{r+1}\|\leq K_{2}\|x^{(r+1)}-x^{(r)}\|$ . where $\partial f$ denotes the Fréchet limiting subdifferential of $f$ **[36]**. 3. C3.

There exists a convergent subsequence $\{x^{(r_{j})}\}_{j\in\mathbb{N}}$ with limit $\hat{x}$ and $f(x^{(r_{j})})\to f(\hat{x})$ .

Then the whole sequence $\{x^{(r)}\}_{r\in\mathbb{N}}$ converges to $\hat{x}$ and $\hat{x}$ is a critical point of $f$ in the sense that $0\in\partial f(x)$ . Moreover the sequence has finite length, i.e.,

[TABLE]

Clearly, if $f$ is differentiable at $x$ , then $x$ is a critical point of $f$ , if and only if $\nabla f(x)=0$ . We will only need this case.

Similar arguments as used in the proof of the above theorem lead to the next corollary, see [2, Corollary 2.7].

Corollary 4.6.

Let $f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{+\infty\}$ be a proper, lsc function which satisfies the KL property at $x^{*}$ . Denote by $U$ , $\eta$ and $\phi$ the objects appearing in the definition of the KL function. Let $\delta,\rho>0$ be such that $B(x^{*},\delta)\subset U$ with $\rho\in(0,\delta)$ . Consider a finite sequence $x^{(r)}$ , $r=0,\dots,n$ , which satisfies the Conditions C1 and C2 of Theorem 4.5 and additionally

C4.

$f(x^{*})\leq f(x^{(0)})<f(x^{*})+\eta$ , 2. C5.

$\|x^{*}-x^{(0)}\|+2\sqrt{\frac{f(x^{(0)})-f(x^{*})}{K_{1}}}+\frac{K_{2}}{K_{1}}\phi(f(x^{(0)})-f(x^{*}))\leq\rho$ .

If for all $r=0,\dots,n$ it holds

[TABLE]

then $x^{(r)}\in B(x^{*},\rho)$ for all $r=0,\dots,n+1$ .

Now we can prove our main convergence theorem.

Theorem 4.7.

The sequence $\{a^{(r)}\}_{r}$ generated by Algorithm 1 converges to a critical point of $E$ .

Proof.

If the sequence is finite, the claim follows from Lemma 4.1. Assume that the algorithm produces an infinite sequence. Since the sequence $(a^{(r)})_{r\in\mathbb{N}}$ on $\mathbb{S}^{d-1}$ is bounded, there exists a convergent subsequence $(a^{(r_{j})})_{j\in\mathbb{N}}$ with $\lim_{j\to\infty}a^{(r_{j})}=\hat{a}$ . Possibly, there exist multiple accumulation points and we distinguish two cases.

First assume that no accumulation point is in the anchor set. It is easy to verify that the function $E$ is semi-algebraic on $\mathbb{R}^{d}$ and hence fulfills the KL property. We will verify that $\{a^{(r)}\}_{r}$ fulfills the remaining conditions C1 and C2 from Theorem 4.5. From the proof of Lemma 4.2, Case 1, we get

[TABLE]

Further, it holds $\|P_{a^{(r)}}y_{i}\|\leq\|y_{i}\|\leq\max_{i=1,\dots,N}\|y_{i}\|<\infty$ and there exists $m>0$ such that

[TABLE]

Using that $a^{(r)}\in\mathrm{span}(Y)$ for all $r$ and $\lim_{r\to\infty}\|a^{(r+1)}-a^{(r)}\|=0$ by Lemma 4.3, we can find $i\in\{1,\ldots,N\}$ such that $|\langle a^{(r)},y_{i}\rangle|>\frac{m}{2}$ , $|\langle a^{(r+1)},y_{i}\rangle|>\frac{m}{2}$ and both scalar products have the same sign for $r$ large enough. Hence we can estimate

[TABLE]

where w.l.o.g $\frac{\langle a^{(r+1)},y_{i}\rangle}{\langle a^{(r)},y_{i}\rangle}\geq 1$ . Using the projection onto the sphere, we can finally estimate

[TABLE]

Next we check the second condition C2. Since $\lim_{r\to\infty}\|a^{(r+1)}-a^{(r)}\|=0$ and none of the $\pm y_{i}/\|y_{i}\|$ , $i=1,\ldots,N$ , is an accumulation point, we can find open balls $B_{i}$ around every $y_{i}$ such that for all $r$ large enough we have $\overline{a^{(r)}a^{(r+1)}}\subset\Omega\coloneqq\mathbb{R}^{d}\setminus\bigcup_{i=1}^{N}B_{i}$ . The function $E$ is smooth on an open set containing the compact set $\Omega$ and hence there exists $C>0$ such that

[TABLE]

for all $r$ large enough. Further, note that the sequence $s_{a^{(r)}}$ is bounded from above on $\Omega$ which implies

[TABLE]

Using the iteration law $a^{(r+1)}=\Pi_{\mathbb{S}^{d-1}}(a^{(r)}-\frac{\nabla E(a^{(r)})}{s_{a^{(r)}}})$ together with the fact that $\nabla E(a^{(r)})$ is in the tangential plane of $\mathbb{S}^{d-1}$ at $a^{(r)}$ , we get by the law of sines, see Fig. 3,

[TABLE]

where the right hand side converges to one since $\angle(a^{(r)}a^{(r+1)})$ gets arbitrary small. Hence, the right hand side is larger than $\frac{1}{2}$ for $r$ large enough and we can estimate

[TABLE]

Now, by Theorem 4.5, only one accumulation point exists which is also a critical point.

It remains to examine the case that some accumulation point is an anchor point $\hat{a}$ to the vertices $y_{k}$ , $k\in\mathcal{K}$ . Assume that there exists another accumulation point. Then, by Lemma 4.3, there exists an accumulation point $\tilde{a}$ which is not an anchor point. We can find a ball $B(\tilde{a},R)$ around $\tilde{a}$ which has positive distance to all anchor points. Next, for all the iterates $a^{(r)}\in B(\tilde{a},\frac{R}{2})$ and $r$ large enough we can reproduce step one of the proof to show that C1 and C2 are fulfilled. Be the continuity of $f$ and $\phi$ , see also the proof of [2, Theorem 2.9], we can choose a ball $B(\tilde{a},\delta)\subset B(\tilde{a},\frac{R}{2})\cap U$ (where $U$ is from the definition of the KL property), $\rho\in(0,\delta)$ and a starting iterate $a^{(r_{0})}\in B(\tilde{a},\rho)$ which satisfies C4 and C5 from Corollary 4.6. Since $\lim_{r\to\infty}\|a^{(r+1)}-a^{(r)}\|=0$ and $\tilde{a}$ is an accumulation point, we can choose $r_{0}$ such that

[TABLE]

for all $r\geq r_{0}$ . Either all iterates after $a^{(r_{0})}$ are in $B(\tilde{a},\rho)$ or there is a finite sequence $a^{(r_{0})},a^{(r_{0}+1)},\ldots,a^{(r_{n})}$ such that $a^{(r_{n}+1)}$ is the first element outside $B(\tilde{a},\rho)$ . But then, by Corollary 4.6, also the iterate $a^{(r_{n}+1)}$ is inside $B(\tilde{a},\rho)$ and hence all iterates stay in $B(\tilde{a},\rho)$ which is an contradiction. Consequently, the whole sequence converges to the anchor point $\hat{a}$ .

It remains to show that the anchor point is critical. By Lemma 4.4 we know that

[TABLE]

If $\hat{a}$ is not a critical point, i.e. $\|G_{\hat{a},\mathcal{K}}\|>{\sum_{k\in\mathcal{K}}\|y_{k}\|}$ , then the sequence cannot converge to $\hat{a}$ , which is a contradiction.

∎

At this point it should be mentioned that Algorithm 1 may converge to a local minimum as our functional is non-convex. Performing the algorithm multiple times with random initialization $a^{(0)}$ and comparing the function values of the results increases the probability to reach a global minimizer. The number of local minimizers and how pronounced they are, depends on the data. In general, with fewer data points and more extreme outliers, we tend to get more pronounced local minima. However, in most applications and also in the numerical part of this paper, this is not an issue as a high number of data points is available.

5 Remarks on the Offset

Finally, we want to address briefly the issue of choosing a suitable offset for the robust PCA model.

As already mentioned in the introduction, in classical PCA, solving

[TABLE]

leads to the unique affine subspace

[TABLE]

where $\hat{b}\in\mathbb{R}^{d}$ can be chosen as mean (bias) $\bar{b}:=\frac{1}{N}(x_{1}+\ldots+x_{N})$ of the data. For the robust setting, we have assumed so far that the offset $\hat{b}$ is given, e.g., as geometric median of the data. However, the problem

[TABLE]

has in general not the geometric median as correct offset as we will see in the following.

Lemma 5.1.

Let $x_{i}\in\mathbb{R}^{2}$ , $i=1,\ldots,N$ . Then there exists a minimizing pair

[TABLE]

such that the line $g(t):=\hat{a}t+\hat{b}$ passes through two of the points. If $N$ is odd, then the minimizing line always passes through two points.

Proof.

Assume that $g$ is an optimal line which does not go through any of the points. Let $N_{l}$ , resp. $N_{r}$ be the number of points on the left, resp., right hand side of $g$ . Then, shifting the line by $\delta>0$ into the direction of the left, resp. right nearest point changes the distance sum by $(N_{r}-N_{l})\delta$ , resp. $(N_{l}-N_{r})\delta$ . If $N_{l}\not=N_{r}$ , then one of the new distance sums becomes smaller than the original minimal one. Hence, $N_{l}=N_{r}$ , so that one point has to be on the line if $N$ is odd and there is a line with smallest distance sum going through one point if $N$ is even. W.l.o.g., let $g$ go through $x_{N}$ . Then, choosing $b=x_{N}$ we have to show that $g$ goes through a second point. Taking polar coordinates $y_{i}:=x_{i}-x_{N}=c_{i}{\,\mathrm{e}}^{i\gamma_{i}}$ , $i\in\{1,\ldots,N-1\}$ , and $a={\,\mathrm{e}}^{i\alpha}$ , the distance sum becomes

[TABLE]

If $y_{i}\not\in g$ for all $i=1,\ldots,N-1$ , then $\varphi$ is smooth and

[TABLE]

so that $\alpha$ cannot be a local minimizer. Consequently, at least one more $x_{i}$ must lie on $g$ . ∎

Using the decomposition of $A\in\mathbb{S}_{d,d-1}$ into Givens rotation matrices, the claim can be generalized to hyperplanes of dimension $d-1$ having minimal Euclidean distance from data in $\mathbb{R}^{d}$ , $d\geq 2$ , see [46]. However, it would be interesting if in this case also $d-1$ data points can lie within the minimizing hyperplane instead of just two of them.

Based on the lemma, the following example shows that the geometric median is in general not in the solution set of (52).

Example 5.2.

Let $x_{i}\in\mathbb{R}^{2}$ , $i=1,2,3$ span a triangle with sides $s_{1}=\|x_{2}-x_{3}\|$ , $s_{2}=\|x_{1}-x_{3}\|$ , $s_{3}=\|x_{1}-x_{2}\|$ , where $s_{1}\leq s_{2}<s_{3}$ and angles smaller than $120^{\circ}$ . By Lemma 5.1, the line having minimal Euclidean distance from the three points has to go through two points. Since the height $h_{i}$ at side $s_{i}$ , $i=1,2,3$ fulfills

[TABLE]

we conclude that the line must go through $x_{1}$ and $x_{2}$ and has distance $h_{3}$ from $x_{3}$ . On the other hand, it is easy to check (and known) that the geometric median of the data points is the so-called Steiner point from which the points can be seen under an angle of $120^{\circ}$ . Clearly, the minimizing line does not pass the Steiner point.

6 Numerical Examples

In this section, we present various numerical examples. In particular, we compare our Algorithm 1 (with the geometric median $b$ ) with standard PCA and the following methods:

i)

PC-L1: the greedy algorithm for minimizing (8) proposed by Kwak [25]. As $b$ we used the geometric median computed by Weiszfeld’s algorithm 1.

ii)

TRPCA: the trimmed PCA of Podosinnikova et al. [44] with default parameters, i.e., the lower bound on the number of true observations is set to $\frac{N}{2}$ and the number of random restarts is $10$ . Here, $b$ equals the mean of a certain subset of the given data determined within the algorithm.

6.1 Image Sequence with Slightly Varying Background

We consider an image sequence with slightly varying background as it was used for object detection in various papers. The water front data set, see Fig. 4, was originally considered in [30]. It was used for performance comparisons with several robust PCA methods including those of Candes et al. [5] in the context of object detection in [44], where TRPCA outperformed the other methods. The data set consists of $633$ frames of size $128\times 160$ of a scenery with water and grass as background. Beginning with frame $481$ a person walks into the scene, which we consider as ”outlier” frames in the data set. We aim to detect the frames with the person present, and then to separate background (scenery) and foreground (person). It turns out this can be achieved simply by thresholding the Euclidean distances of the vectorized data $x_{i}\in\mathbb{R}^{20480}$ to their geometric median. The frames with the person in them can be detected from the histogram in Fig. 4(c). More precisely, all frames with distance larger than $6$ can be considered as outliers which exactly matches the frames with the person present. The foreground in these images can then be extracted as the difference image to the geometric median and subsequent pixelwise thresholding. The difference image for one frame is given in Fig. 4(d).

In order to make the task more challenging and simulate a gradual change in lighting conditions, we alter the data as follows. Given the points $x_{i}$ , $i=1,\ldots,633$ , we created new data

[TABLE]

Here, outlier frames cannot be found by the previous method since the distance of the frames from their geometric median varies by construction, see Fig. 5 (left). But a model with line fitting, i.e. with $K=1$ , is suitable by the construction of the data. Fig. 5 depicts the histogram of the distances of the frames $\tilde{x}_{i}$ from the line generated by the standard PCA and by the residual minimizing robust PCA, respectively. The outliers can be better separated by the residual minimizing robust PCA as the frames belonging to the peak between $4$ and $5$ are less likely to be wrongfully mislabeled as outliers.

Finally, Fig. 6 shows the foreground–background separation in frame $i=580$ by various methods. We show the projected data (background) $x_{i,\mathrm{rec}}=a_{1}a_{1}^{\mathrm{T}}(x_{i}-b)+b$ (left) and the residual (person) $x_{i,\mathrm{res}}=x_{i}-x_{i,\mathrm{rec}}$ , $i=580$ . In the background and foreground of standard PCA, artifacts can be clearly seen at positions where the person rests for a longer time. The PCA-L1 and our approach appear to be more robust here, but the artifacts are more pronounced in the PCA-L1. TRPCA with several restarts gives the best results.

6.2 Face reconstruction

For images of faces in the same pose but with different lighting, it may be assumed that they lie in a low dimensional subspace [8]. Thus standard PCA is a suitable tool to reduce the dimensionality of such data for classification and other tasks. In practice, however, some of the face images may be occluded resulting in outliers within the data. Here, robust PCA methods appear to be more useful. We test the performance of various approaches with the cropped Extended Yale Face database B [26]. There are $58$ images of size $168\times 192$ of which $12$ were altered with a $50\times 50$ square patch of noise at a random position, see the left image in Fig. 7 for an example.

For noiseless face image data, standard PCA projection onto a subspace of dimension $K=5$ gives good results as shown on the right of Fig. 7.

In Fig. 8 the projection of the noisy data on the subspace obtained by various approaches are shown. As expected the noisy patches can be clearly seen in the reconstructions by standard PCA. Surprisingly, the result of PCA-L1 looks worse than standard PCA, as the influence of the noisy patches is even worse. The results of TRPCA with several restarts are very similar to those of the standard PCA of the noiseless data except of the right eye in the second image which appears to be too dark. This suggests that the algorithm successfully excluded the outliers and calculated the principal components from a part of the noiseless data. However, it should be mentioned that TRPCA sometimes fails to detect the outliers as it depends on the initial values of the random restarts. The results of residual minimizing robust PCA demonstrate the robustness of the method to outliers although slight artifacts are still visible.

7 Conclusions

We proposed a Weiszfeld-like algorithm to address the robust PCA problem arising from a minimal distance function of lines from points and gave a circumvent convergence analysis of the algorithm. We will generalize this to multiple directions by considering the minimization on Stiefel manifolds, resp. Grassmannians, where we have already recognized that several methods proposed in the literature just coincide from the point of view of Grassmannians. Further extensions of our findings are possible such as the treatment of robust independent component analysis (ICA) and PCA on manifolds, see, e.g. [10, 11, 17, 42, 43, 48]. Another modification of PCA, the so-called sparse PCA couples the data term (1) with a sparsity term for $A\in\mathbb{R}^{d,K}$ , see, e.g., [12, 32] and could be considered under the robustness point of view.

Acknowledgments

Funding by the German Research Foundation (DFG) within the Research Training Group 1932, project area P3, is gratefully acknowledged.

Appendix A Appendix: One-Sided Derivatives and Minimizers on Embedded Manifolds

The one-sided directional derivative of a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , $d\in\mathbb{N}$ , at a point $x\in\mathbb{R}^{d}$ in direction $h\in\mathbb{R}^{d}$ is defined by

[TABLE]

Restricting $f$ to a submanifold $\mathcal{M}\subseteq\mathbb{R}^{d}$ , we can restrict our considerations to $h\in T_{x}\mathcal{M}$ . Recall that $\mathcal{M}\subseteq\mathbb{R}^{d}$ is an $m$ -dimensional submanifold of $\mathbb{R}^{d}$ if for each point $x\in\mathcal{M}$ there exists an open neighborhood $U\subseteq\mathbb{R}^{d}$ as well as an open set $\Omega\subseteq\mathbb{R}^{m}$ and a so-called parametrization $\varphi\in C^{1}(\Omega,\mathbb{R}^{d})$ of $\mathcal{M}$ with the properties

i)

$\varphi(\Omega)=\mathcal{M}\cap U$ ,

ii)

$\varphi^{-1}:\mathcal{M}\cap U\rightarrow\Omega$ is surjective and continuous, and

iii)

$D\varphi(x)$ has full rank $m$ for all $x\in\Omega$ .

To establish the relation between one-sided directional derivatives and local minima of functions on manifolds we need the following lemma. A proof can be found in [37, Lemma B.1].

Lemma A.1.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be an $m$ -dimensional manifold of $\mathbb{R}^{d}$ . Then the tangent space $T_{x}\mathcal{M}$ and the tangent cone

[TABLE]

coincide.

The following theorem gives a general necessary and sufficient condition for local minimizers of Lipschitz continuous functions on embedded manifolds using the notation of one-sided derivatives. For the Euclidean setting $\mathcal{M}=\mathbb{R}^{d}$ , the first relation of the proposition is trivially fulfilled for any function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , while a proof of the sufficient minimality condition in the second part was given in [4]. Moreover, the authors of [4] gave an example that Lipschitz continuity in the second part cannot be weakened to just continuity.

Theorem A.2.

Let $\mathcal{M}\subset\mathbb{R}^{d}$ be an $m$ -dimensional submanifold of $\mathbb{R}^{d}$ and $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ a locally Lipschitz continuous function. Then the following holds true:

If $\hat{x}\in\mathcal{M}$ is a local minimizer of $f$ on $\mathcal{M}$ , then $Df(\hat{x};h)\geq 0$ for all $h\in T_{\hat{x}}\mathcal{M}$ . 2. 2.

If $Df(\hat{x};h)>0$ for all $h\in T_{\hat{x}}\mathcal{M}\setminus\{\mathbf{0}\}$ , then $\hat{x}$ is a strict local minimizer of $f$ on $\mathcal{M}$ .

A proof can be found in [37, Thm. 6.1] along with an example which demonstrates the necessity of the Lipschitz continuity of $f$ in the manifold setting in the first part of the theorem. Furthermore, note that $Df(\hat{x};h)\geq 0$ for all $h\in T_{\hat{x}}\mathcal{M}\setminus\{\mathbf{0}\}$ does not imply that $\hat{x}$ is a local minimizer of $f$ on $\mathcal{M}$ .

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research , 35(2):438–457, 2010.
2[2] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Mathematical Programming , 137(1-2, Ser. A):91–129, 2013.
3[3] A. Beck and S. Sabach. Weiszfeld’s method: Old and new results. Journal of Optimization Theory and Applications , 164(1):1–40, 2015.
4[4] A. Ben-Tal and J. Zowe. Directional derivatives in nonsmooth optimization. Journal of Optimization Theory and Applications , 47(4):483–490, 1985.
5[5] E. J. Candès, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM , 58(3):Art. 11, 2011.
6[6] E. Chouzenoux, J. Idier, and S. Moussaoui. A majorize-minimize strategy for subspace optimization applied to image restoration. IEEE Transactions on Image Processing , 20(6):1517–1528, 2011.
7[7] C. Ding, D. Zhou, X. He, and H. Zha. R 1 subscript 𝑅 1 R_{1} -PCA: Rotational invariant L 1 subscript 𝐿 1 L_{1} -norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd international conference on Machine learning , pages 281–288. ACM, 2006.
8[8] R. Epstein, P. Hallinan, and A. Yuille. 5 ± plus-or-minus \pm 2 eigenimages suffice: An empirical investigation of low-dimensional lighting models. In IEEE Workshop on Physics-Based Vision , pages 108–116, 1995.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

On the Robust PCA and Weiszfeld’s Algorithm

Abstract

1 Introduction

Outline of the paper

2 Weiszfeld’s Algorithm for Geometric Median Computation

3 Weiszfeld-like Algorithm for Robust PCA

Remark 3.1**.**

Theorem 3.2**.**

Proof.

4 Convergence Analysis

Lemma 4.1**.**

Proof.

Lemma 4.2**.**

Proof.

Lemma 4.3**.**

Proof.

Lemma 4.4**.**

Proof.

Theorem 4.5**.**

Corollary 4.6**.**

Theorem 4.7**.**

Proof.

5 Remarks on the Offset

Lemma 5.1**.**

Proof.

Example 5.2**.**

6 Numerical Examples

6.1 Image Sequence with Slightly Varying Background

6.2 Face reconstruction

7 Conclusions

Acknowledgments

Appendix A Appendix: One-Sided Derivatives and Minimizers on Embedded Manifolds

Lemma A.1**.**

Theorem A.2**.**

Remark 3.1.

Theorem 3.2.

Lemma 4.1.

Lemma 4.2.

Lemma 4.3.

Lemma 4.4.

Theorem 4.5.

Corollary 4.6.

Theorem 4.7.

Lemma 5.1.

Example 5.2.

Lemma A.1.

Theorem A.2.