Multivariate Regression with Gross Errors on Manifold-valued Data

Xiaowei Zhang; Xudong Shi; Yu Sun; Li Cheng

arXiv:1703.08772·stat.ML·September 12, 2017

Multivariate Regression with Gross Errors on Manifold-valued Data

Xiaowei Zhang, Xudong Shi, Yu Sun, Li Cheng

PDF

Open Access

TL;DR

This paper introduces a novel multivariate regression model for manifold-valued data that effectively handles gross errors by correcting responses via geodesic curves, and employs a specialized optimization algorithm with proven convergence.

Contribution

The paper proposes PALMR, a new approach for robust multivariate regression on manifolds with gross errors, extending proximal alternating linearized minimization techniques.

Findings

01

Outperforms existing models on synthetic data.

02

Effective in identifying gross errors in diffusion tensor imaging.

03

Converges to a critical point under mild conditions.

Abstract

We consider the topic of multivariate regression on manifold-valued output, that is, for a multivariate observation, its output response lies on a manifold. Moreover, we propose a new regression model to deal with the presence of grossly corrupted manifold-valued responses, a bottleneck issue commonly encountered in practical scenarios. Our model first takes a correction step on the grossly corrupted responses via geodesic curves on the manifold, and then performs multivariate linear regression on the corrected data. This results in a nonconvex and nonsmooth optimization problem on manifolds. To this end, we propose a dedicated approach named PALMR, by utilizing and extending the proximal alternating linearized minimization techniques. Theoretically, we investigate its convergence property, where it is shown to converge to a critical point under mild conditions. Empirically, we test our…

Figures40

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Median values of prediction errors on all six slices of testing data. We use two metrics, relative FA error and MSGE, to measure the prediction error. The best results in each setting are highlighted in bold.

Metrics

Methods

No

gross error

20% manual

gross error

20%

registration

error

Slice

z = 32

\begin{matrix} Relative \\ FA  error \end{matrix}

FA regression

0.9376

1.0414

0.9467

MGLM

0.3223

0.4349

0.1654

PALMR

0.3210

0.3409

0.1316

MSGE

MGLM

0.1475

0.3530

0.1949

PALMR

0.1386

0.2196

0.1508

Slice

x = 55

\begin{matrix} Relative \\ FA  error \end{matrix}

FA regression

0.9238

1.0362

0.8688

MGLM

0.3298

0.5089

0.2067

PALMR

0.3279

0.3682

0.1882

MSGE

MGLM

0.1606

0.3631

0.3513

PALMR

0.1602

0.2562

0.2915

Slice

y = 64

\begin{matrix} Relative \\ FA  error \end{matrix}

FA regression

0.8822

1.0136

0.9528

MGLM

0.3162

0.4564

0.1917

PALMR

0.3166

0.3665

0.1562

MSGE

MGLM

0.1687

0.3720

0.2449

PALMR

0.1614

0.2843

0.1906

Slice

z = 24

\begin{matrix} Relative \\ FA  error \end{matrix}

FA regression

0.8478

1.0066

0.8144

MGLM

0.3570

0.7342

0.2140

PALMR

0.3564

0.5081

0.1581

MSGE

MGLM

0.1227

0.3466

0.2954

PALMR

0.1160

0.2530

0.2445

Slice

x = 64

\begin{matrix} Relative \\ FA  error \end{matrix}

FA regression

0.9723

1.0526

0.9067

MGLM

0.2142

0.4053

0.5023

PALMR

0.2114

0.3318

0.4318

MSGE

MGLM

0.1646

0.3663

0.2436

PALMR

0.1639

0.2779

0.2226

Slice

y = 45

\begin{matrix} Relative \\ FA  error \end{matrix}

FA regression

0.9715

1.0695

0.9379

MGLM

0.3779

0.5976

0.1739

PALMR

0.3767

0.5319

0.1664

MSGE

MGLM

0.2162

0.4205

0.2928

PALMR

0.2113

0.3780

0.2593

Equations97

prox_{λ}^{σ} (p) := z argmin {σ (z) + \frac{λ}{2} ∥ z - p ∥}

prox_{λ}^{σ} (p) := z argmin {σ (z) + \frac{λ}{2} ∥ z - p ∥}

q \neq = p lim q \to p in f \frac{σ ( q ) - σ ( p ) - ⟨ v , γ ^{'} ( 0 ) ⟩}{d ( p , q )} \geq 0,

q \neq = p lim q \to p in f \frac{σ ( q ) - σ ( p ) - ⟨ v , γ ^{'} ( 0 ) ⟩}{d ( p , q )} \geq 0,

\partial σ (p) = {

\partial σ (p) = {

\exists v^{k} \in \hat{\partial} σ (p^{k}) \mbox s . t . P_{γ^{k} (0) γ^{k} (1)} (v^{k}) \to v},

\mbox cr i t σ = {x \in M : 0 \in \partial σ (x)} .

\mbox cr i t σ = {x \in M : 0 \in \partial σ (x)} .

∥ \partial σ (γ (t)) - P_{γ (0) γ (t)} \partial σ (p) ∥_{γ (t)} \leq L l (t), \forall t \in [0, r],

∥ \partial σ (γ (t)) - P_{γ (0) γ (t)} \partial σ (p) ∥_{γ (t)} \leq L l (t), \forall t \in [0, r],

∥ \partial σ (γ (t)) - P_{γ (0) γ (t)} \partial σ (p) ∥_{γ (t)} \leq L d (p, γ (t)) .

∥ \partial σ (γ (t)) - P_{γ (0) γ (t)} \partial σ (p) ∥_{γ (t)} \leq L d (p, γ (t)) .

\mbox d i s t (x, A) := in f {d (x, y) : y \in A},

\mbox d i s t (x, A) := in f {d (x, y) : y \in A},

[α \leq σ \leq β] := {x \in M : α \leq σ (x) \leq β} .

[α \leq σ \leq β] := {x \in M : α \leq σ (x) \leq β} .

ϕ^{'} (σ (x) - σ (\overset{ˉ}{x})) \mbox d i s t (0, \partial σ (x)) \geq 1,

ϕ^{'} (σ (x) - σ (\overset{ˉ}{x})) \mbox d i s t (0, \partial σ (x)) \geq 1,

Y = X V^{*} + Z,

Y = X V^{*} + Z,

Y = X V^{*} + G^{*} + Z,

Y = X V^{*} + G^{*} + Z,

V, G min \frac{1}{2} ∥ Y - X V - G ∥_{F}^{2} + λ R_{v} (V) + ρ R_{g} (G),

V, G min \frac{1}{2} ∥ Y - X V - G ∥_{F}^{2} + λ R_{v} (V) + ρ R_{g} (G),

\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i})=\text{Exp}\Big{(}\text{Exp}\big{(}\bm{p},\sum_{j=1}^{d}x_{i}^{j}\bm{v}_{j}\big{)},~{}\bm{z}_{i}\Big{)},

\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i})=\text{Exp}\Big{(}\text{Exp}\big{(}\bm{p},\sum_{j=1}^{d}x_{i}^{j}\bm{v}_{j}\big{)},~{}\bm{z}_{i}\Big{)},

E\left(\bm{p},\{\bm{v}_{j}\},\{\bm{g}_{i}\}\right):=\frac{1}{2}\sum_{i}d^{2}\Big{(}\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i}),\text{Exp}_{\bm{p}}(\sum_{j}x_{i}^{j}\bm{v}_{j})\Big{)}

E\left(\bm{p},\{\bm{v}_{j}\},\{\bm{g}_{i}\}\right):=\frac{1}{2}\sum_{i}d^{2}\Big{(}\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i}),\text{Exp}_{\bm{p}}(\sum_{j}x_{i}^{j}\bm{v}_{j})\Big{)}

(\tilde{p}, {\tilde{v}_{j}}, {\tilde{g}}_{i}) =

(\tilde{p}, {\tilde{v}_{j}}, {\tilde{g}}_{i}) =

+ λ R_{v} ({v_{j}}) + ρ R_{g} ({g_{i}}),

\displaystyle\min\limits_{\tiny\begin{array}[]{c}(\bm{p},\{\bm{v}_{j}\})\in\mathcal{M}\times T\mathcal{M}\\ \bm{g}_{i}\in T_{\bm{y}_{i}}\mathcal{M}\end{array}}E\left(\bm{p},\{\bm{v}_{j}\},\{\bm{g}_{i}\}\right)+\lambda\sum_{j=1}^{d}\left\lVert\bm{v}_{j}\right\rVert_{\bm{p}}+\rho\sum_{i=1}^{N}\left\lVert\bm{g}_{i}\right\rVert_{\bm{y}_{i}}.

\displaystyle\min\limits_{\tiny\begin{array}[]{c}(\bm{p},\{\bm{v}_{j}\})\in\mathcal{M}\times T\mathcal{M}\\ \bm{g}_{i}\in T_{\bm{y}_{i}}\mathcal{M}\end{array}}E\left(\bm{p},\{\bm{v}_{j}\},\{\bm{g}_{i}\}\right)+\lambda\sum_{j=1}^{d}\left\lVert\bm{v}_{j}\right\rVert_{\bm{p}}+\rho\sum_{i=1}^{N}\left\lVert\bm{g}_{i}\right\rVert_{\bm{y}_{i}}.

v_{j}, g_{i} \in R^{m} min \frac{1}{2} i = 1 \sum N ∥ y_{i} - j = 1 \sum d x_{i}^{j} v_{j} + g_{i} ∥^{2} + λ j = 1 \sum d ∥ v_{j} ∥ + ρ i = 1 \sum N ∥ g_{i} ∥,

v_{j}, g_{i} \in R^{m} min \frac{1}{2} i = 1 \sum N ∥ y_{i} - j = 1 \sum d x_{i}^{j} v_{j} + g_{i} ∥^{2} + λ j = 1 \sum d ∥ v_{j} ∥ + ρ i = 1 \sum N ∥ g_{i} ∥,

V, G min \frac{1}{2} ∥ Y - X V - G ∥_{F}^{2} + λ ∥ V ∥_{1, 2} + ρ ∥ G ∥_{1, 2},

V, G min \frac{1}{2} ∥ Y - X V - G ∥_{F}^{2} + λ ∥ V ∥_{1, 2} + ρ ∥ G ∥_{1, 2},

v_{j}, g_{i} \in R^{m} min

v_{j}, g_{i} \in R^{m} min

+ λ j = 1 \sum d ∥ v_{j} ∥ + ρ i = 1 \sum N ∥ g_{i} ∥,

V, G min

V, G min

s . t .

\displaystyle\min\limits_{(\bm{p},\{\bm{v}_{j}\})\in\mathcal{M}\times T\mathcal{M}}~{}\frac{1}{2}\sum_{i=1}^{N}d^{2}\bigg{(}\bm{y}_{i},\text{Exp}_{\bm{p}}\Big{(}\sum_{j=1}^{d}x_{i}^{j}\bm{v}_{j}\Big{)}\bigg{)},

\displaystyle\min\limits_{(\bm{p},\{\bm{v}_{j}\})\in\mathcal{M}\times T\mathcal{M}}~{}\frac{1}{2}\sum_{i=1}^{N}d^{2}\bigg{(}\bm{y}_{i},\text{Exp}_{\bm{p}}\Big{(}\sum_{j=1}^{d}x_{i}^{j}\bm{v}_{j}\Big{)}\bigg{)},

x \in M_{1}, y \in M_{2} min Ψ (x, y) := f (x) + g (y) + h (x, y),

x \in M_{1}, y \in M_{2} min Ψ (x, y) := f (x) + g (y) + h (x, y),

x^{k + 1} \in x \in M_{1} argmin

x^{k + 1} \in x \in M_{1} argmin

+ \frac{c _{k}}{2} d_{M_{1}}^{2} (x^{k}, x),

y^{k + 1} \in y \in M_{2} argmin

+ \frac{d _{k}}{2} d_{M_{2}}^{2} (y^{k}, y),

v^{k} \in v \in T_{x^{k}} M_{1} argmin (f \circ \mbox E x p_{x^{k}}) (v) + ⟨ v, \partial_{x} h (x^{k}, y^{k}) ⟩ + \frac{c _{k}}{2} ∥ v ∥^{2},

v^{k} \in v \in T_{x^{k}} M_{1} argmin (f \circ \mbox E x p_{x^{k}}) (v) + ⟨ v, \partial_{x} h (x^{k}, y^{k}) ⟩ + \frac{c _{k}}{2} ∥ v ∥^{2},

v^{k} \in v \in T_{x^{k}} M_{1} argmin (f \circ \mbox E x p_{x^{k}}) (v) + \frac{c _{k}}{2} ∥ v + \frac{1}{c _{k}} \partial_{x} h (x^{k}, y^{k}) ∥^{2},

v^{k} \in v \in T_{x^{k}} M_{1} argmin (f \circ \mbox E x p_{x^{k}}) (v) + \frac{c _{k}}{2} ∥ v + \frac{1}{c _{k}} \partial_{x} h (x^{k}, y^{k}) ∥^{2},

v^{k} = \mbox p r o x_{c_{k}}^{f \circ \mbox E x p_{x^{k}}} (- \frac{1}{c _{k}} \partial_{x} h (x^{k}, y^{k})) .

v^{k} = \mbox p r o x_{c_{k}}^{f \circ \mbox E x p_{x^{k}}} (- \frac{1}{c _{k}} \partial_{x} h (x^{k}, y^{k})) .

k \in N in f {L_{1} (y^{k})} \geq λ_{1}^{-}, k \in N in f {L_{2} (x^{k})} \geq λ_{2}^{-},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neuroimaging Techniques and Applications · Morphological variations and asymmetry · Bone and Joint Diseases

MethodsLinear Regression

Full text

Multivariate Regression with Gross Errors

on Manifold-valued Data

Xiaowei Zhang, , Xudong Shi, Yu Sun, , Li Cheng Xiaowei Zhang is with Bioinformatics Institute, ASTAR, Singapore. E-mail: [email protected] Shi is with School of Computing, National University of Singapore, Singapore. The research is carried out when he is an intern in ASTAR. Yu Sun is with the Singapore Institute for Neurotechnology, National University of Singapore, Singapore 117456. Li Cheng is with Bioinformatics Institute, A*STAR, Singapore (Corresponding author) E-mail: [email protected]

Abstract

We consider the topic of multivariate regression on manifold-valued output, that is, for a multivariate observation, its output response lies on a manifold. Moreover, we propose a new regression model to deal with the presence of grossly corrupted manifold-valued responses, a bottleneck issue commonly encountered in practical scenarios. Our model first takes a correction step on the grossly corrupted responses via geodesic curves on the manifold, then performs multivariate linear regression on the corrected data. This results in a nonconvex and nonsmooth optimization problem on Riemannian manifolds. To this end, we propose a dedicated approach named PALMR, by utilizing and extending the proximal alternating linearized minimization techniques for optimization problems on Euclidean spaces. Theoretically, we investigate its convergence property, where it is shown to converge to a critical point under mild conditions. Empirically, we test our model on both synthetic and real diffusion tensor imaging data, and show that our model outperforms other multivariate regression models when manifold-valued responses contain gross errors, and is effective in identifying gross errors.

Index Terms:

Manifold-valued data, multivariate linear regression, gross error, nonsmooth optimization on manifolds, diffusion tensor imaging.

1 Introduction

This paper focuses on multivariate regression on manifolds [1, 2, 3, 4], where given a multivariate observation $\bm{x}\in\mathbb{R}^{d}$ , the output response $\bm{y}$ lies on a Riemannian manifold $\mathcal{M}$ . This line of work has many applications. For example, research evidence in diffusion tensor imaging (DTI) (e.g. [5]) indicates that the shape and orientation of diffusion tensors are profoundly affected by age, gender and handedness (i.e. left- or right-handed). In particular, we consider noisy manifold-valued output scenarios where data are subject to sporadic contamination by gross errors of large or even unbounded magnitude. Such grossly corrupted data are often encountered in practice due to unreliable data collection or data with missing values: For example, errors in DTI data can be introduced by Echo-Planar Imaging (EPI) distortion [6] or inter-subject registration [7], where practical measurement errors such as Rician noise or other sensor noise have a significant impact on the shape and orientation of tensors [8, 9]. Although the problem of learning from data with possible gross error in Euclidean spaces has gained increasing interest [10, 11, 12, 13, 14, 15], to our best knowledge, there exists no prior work in dealing with manifold-valued response with gross errors.

Our main idea can be summarized as follows: For each manifold-valued response $\bm{y}\in\mathcal{M}$ , we explicitly model its possible gross error (in $\bm{y}$ ). This gives rise to a corrected manifold-valued data $\bm{y}^{c}$ by removing the identified gross error component from $\bm{y}$ , which is realized via geodesic curves on $\mathcal{M}$ . Note that $\bm{y}^{c}$ could be the same as $\bm{y}$ , corresponding to no gross error in $\bm{y}$ . Then the corrected manifold-valued data can be utilized as the responses in multivariate geodesic regression, which boils down to a known problem [2]. More details are illustrated in Figure 1 and are fully described in Section 3. Unfortunately, the induced optimization problem becomes rather challenging as it contains nonconvex and nonsmooth functions on manifolds. Inspired by the recent development of proximal alternating linearized minimization (PALM) methods in Euclidean spaces, in this paper we propose to generalize this technique onto Riemannian manifolds [16], which we have named as PALMR.

The main contributions of this paper are three-fold. First, we propose to address a novel problem of multivariate regression on manifolds where the manifold-valued responses are subject to possible contamination of gross errors. Second, a new algorithm named PALMR is proposed to tackle the induced nonconvex and nonsmooth optimization on manifolds, for which we also provide the convergence analysis. The algorithm and analysis is applicable to a class of nonconvex and nonsmooth optimization problem on manifolds. Empirically our algorithm has been evaluated on both synthetic and real DTI data, where results suggest the algorithm is effective in identifying gross errors and recovering corrupted data, and it produces better predictive results than regression models that do not consider gross errors. Third, our approach makes connections to two established research areas, namely learning from grossly corrupted data and multivariate regression on manifolds: When we restrict ourselves to the special case of Euclidean space, our approach reduces to robust regression considered in e.g. [13, 14]; On the other hand, when there is no gross error, the problem boils down to that of multivariate regression on manifolds as considered in [2], where the method of [2] can be regarded as a special case of our approach. Our code is also made publicly available. 111Our implementation is available at the project website http://web.bii.a-star.edu.sg/~zhangxw/palmr-SPD/.

1.1 Related work

Manifold-valued data arise from a wide range of application domains including neural imaging [17], shape modelling [18, 19, 20], robotics [21], graphics [22], and symmetric positive matrices [23, 24, 25, 26]. One prominent example is DTI [24] where data lie in the Riemannian manifold of $3\times 3$ symmetric positive definite (SPD) matrices. In this work, we use $\mathcal{S}(n)$ and $\mathcal{S}_{++}(n)$ to denote the set of $n\times n$ symmetric matrices and $n\times n$ SPD matrices, respectively. Other examples include higher angular resolution diffusion imaging where data can be modelled as the square root of orientation distribution functions lying on the unit sphere [27, 2], as well as group-valued data such as $SO(3)$ and $SE(3)$ in shape analysis [19] and robotics [21]. It is well known that for such scenarios, it is in general much better to conduct statistical analysis directly on the manifold (i.e. curved space) instead of in the ambient Euclidean space (i.e. flat space), which we also verify empirically.

Unsurprisingly, there exists plenty of prior work studying statistics on manifolds [28, 19, 29, 24, 30]. This is to be distinguished from the well-known topic of manifold learning [31], where the data are assumed to be sampled from certain manifold embedded in a usually much higher dimensional Euclidean space and one is supposed to extract intrinsic geometric properties of the manifold from observations. Instead here the manifold is usually known in priori, and the task is to engage appropriate statistical models in the analysis of the manifold-valued data.

In the area of regression on manifolds, Fletcher [28] proposes geodesic regression that generalizes univariate linear regression on flat spaces to manifolds by regressing a manifold-valued response from a real-valued independent data with a geodesic curve. [27] adapts the idea of geodesic regression for regressing sphere-valued data against real scalar. [20] investigates parametric polynomial regression on Riemannian manifolds, while [1] studies regression on the group of diffeomorphisms for detecting longitudinal anatomical shape changes. Banerjee et al. [32] propose a nonlinear kernel-based regression method for manifold-valued data. Hong et al. [33] propose a shooting spline-based regression technique specifically designed for the Grassmannian manifold. [34, 35] investigate a family of nonparametric regression models for data on manifolds. The closest work might be [2], which extends the idea of geodesic regression [28] to multivariate regression on manifolds, and applies it to analyze diffusion weighted imaging data. In [3], the authors investigate multivariate regression models on Riemannian symmetric spaces from a statistical perspective and develop several test statistics for evaluating linear hypotheses of the regression coefficients. In the area of learning with grossly corrupted data, there have been various methods [10, 13, 14, 15, 36] proposed for linear regression with gross errors in the Euclidean space, among which robust lasso in [13] and robust multi-task regression in [14] can be considered as special cases of our approach when restricted to Euclidean spaces.

A recent trend in manifold data analysis is kernel methods on manifolds which aim at embedding the manifold to a reproducing kernel Hilbert space (RKHS). In [37] and [38], kernel methods are developed for sparse coding and dictionary learning on SPD and Grassmann manifolds, respectively. In [39], kernels on SPD and Grassmann manifolds are considered for classification. As it is important for such kernels on manifolds to satisfy the positive definite constraint, significant efforts [40, 41, 42, 43, 44] have been made in this regard. Meanwhile, as shown in [44], these kernels tend to either disregard the original Riemannian structure due to linearization requirement, or violates the positive definiteness constraint. In particular, a geodesic Gaussian kernel is positive definite only if the underlying manifold is Euclidean. Moreover, a geodesic Laplacian kernel is positive definite if and only if conditionally negative definite conditions are satisfied, which is in general not true for curved Riemannian manifolds. These results suggest that the application of kernel methods in curved manifolds has its limitation. On the other hand, it is also of interest for the community to investigate on approaches other than kernel based methods. This motivates us to consider in this work a manifold-valued geodesic regression approach by directly considering the intrinsic Riemannian metric.

2 Background

We first briefly review some concepts in Riemannian manifolds in subsection 2.1, nonsmooth analysis and Kurdyka–Łojasiewicz property on Riemannian manifolds in subsections 2.2 and 2.3, respectively, which are necessary for the derivation of our algorithm and the proof of convergence. We then review some models regarding multivariate linear regression with gross errors in Euclidean space, whose ideas are utilized to design our new model.

2.1 Riemannian manifolds

Let $(\mathcal{M},\varrho)$ denote a smooth manifold $\mathcal{M}$ endowed with a Riemannian metric $\varrho$ . Moreover, $T_{\bm{p}}\mathcal{M}$ denotes the tangent space at point $\bm{p}$ and $T\mathcal{M}:=\cup_{\bm{p}\in\mathcal{M}}T_{\bm{p}}\mathcal{M}$ denotes the tangent bundle. Notation $(\bm{p},\bm{v})\in\mathcal{M}\times T\mathcal{M}$ refers to $\bm{p}$ being a point of $\mathcal{M}$ and $\bm{v}$ being a tangent vector at $\bm{p}$ . $\left\langle\bm{u},~{}\bm{v}\right\rangle_{\bm{p}}:=\varrho_{\bm{p}}(\bm{u},\bm{v})$ is the inner product between two vectors $\bm{u}$ and $\bm{v}$ in $T_{\bm{p}}\mathcal{M}$ , with $\varrho_{\bm{p}}$ being the metric at $\bm{p}$ . The induced norm thus becomes $\|\bm{u}\|_{\bm{p}}:=\left\langle\bm{u},~{}\bm{u}\right\rangle_{\bm{p}}^{1/2}$ . Let $\gamma:[a,b]\to\mathcal{M}$ be a piecewise smooth curve such that $\gamma(a)=\bm{p}$ and $\gamma(b)=\bm{q}$ , with the curve length as $\int_{a}^{b}\|\gamma^{\prime}(t)\|_{\gamma(t)}dt$ where $\gamma^{\prime}(t)$ denotes derivative. The Riemannian distance $d_{\mathcal{M}}(\bm{p},\bm{q})$ between $\bm{p}$ and $\bm{q}$ is defined as the infimum of the length over all piecewise smooth curves joining these two points. Let $\nabla$ be the Levi-Civita connection 222Roughly speaking, a connection acts as a generalization of directional derivative that connects tangent spaces of nearby points and provides a consistent manner of transporting tangent vectors from one point to another along geodesic curves. A manifold may have many connections. Levi-Civita connection, also called Riemannian connection, is a unique connection that is symmetric and compatible with the Riemannian metric. associated with $(\mathcal{M},\varrho)$ . Curve $\gamma$ is called a geodesic if $\nabla_{\gamma^{\prime}}\gamma^{\prime}=0$ . A Riemannian manifold is complete if its geodesics $\gamma(t)$ are defined for any value of $t\in\mathbb{R}$ . The parallel transport along $\gamma$ from $\bm{p}=\gamma(a)$ to $\bm{q}=\gamma(b)$ is a mapping $P_{\gamma(a)\gamma(b)}:T_{\bm{p}}\mathcal{M}\to T_{\bm{q}}\mathcal{M}$ defined by $P_{\gamma(a)\gamma(b)}(\bm{v})=V(b)$ , where $V$ is the unique vector field satisfying $\nabla_{\gamma^{\prime}}V=0$ and $V(a)=\bm{v}$ . The exponential map at point $\bm{p}$ is a mapping $\text{Exp}_{\bm{p}}:T_{\bm{p}}\mathcal{M}\to\mathcal{M}$ defined as $\text{Exp}_{\bm{p}}(\bm{v})=\gamma(1)$ , where $\gamma:[0,1]\to\mathcal{M}$ is the geodesic such that $\gamma(0)=\bm{p}$ and $\gamma^{\prime}(0)=\bm{v}$ . The inverse of the exponential map, if exists, is denoted by $\text{Exp}_{\bm{p}}^{-1}$ . To simplify notations, we also use $\left\langle,~{}\right\rangle$ , $\|\cdot\|$ , $d(\cdot,\cdot)$ , and $\text{Exp}(\bm{p},\bm{v})$ to denote inner product, norm, Riemannian distance, and exponential map respectively, when there is no confusion. We focus on Hadamard manifold $\mathcal{M}$ , which is a complete and simply connected finite dimensional Riemannian manifold with nonpositive sectional curvature. The class of Hadamard manifolds possesses many nice properties: For example, any two points in $\mathcal{M}$ can be joined by a unique geodesic. In this case, the exponential map is a global diffeomorphism and $d(\bm{p},\bm{q})=\|\text{Exp}_{\bm{p}}^{-1}\bm{q}\|_{\bm{p}}$ . One example of Hadamard manifold is the manifold of symmetric positive definite matrices. Motivated readers can consult [45] for further details of manifolds and differential geometry.

2.2 Nonsmooth analysis on Riemannian manifolds

Given an extended real-valued function $\sigma:\mathcal{M}\to\mathbb{R}\cup\{+\infty\}$ we define its domain by $\mbox{dom}~{}\sigma:=\{\bm{p}\in\mathcal{M}:\sigma(\bm{p})<+\infty\}$ and its epigraph by $\mbox{epi}\sigma:=\{(\bm{p},\beta)\in\mathcal{M}\times\mathbb{R}:\sigma(\bm{p})\leq\beta\}$ . We say that $\sigma$ is a lower semicontinuous function if $\mbox{epi}\sigma$ is closed, and is proper if $\mbox{dom}~{}\sigma\neq\emptyset$ and $\sigma(\bm{p})>-\infty$ for all $\bm{p}\in\mbox{dom}~{}\sigma$ . Proper and lower semicontinuous (PLS) functions play important roles in optimization, since it guarantees the well-definedness of the proximal operator. In particular, given $\bm{p}$ and $\lambda>0$ , the proximal map defined as

[TABLE]

is well-defined when $\sigma$ is PLS and $\text{inf}~{}\sigma(\bm{z})>0$ . In Section 3, we will see that the objective function in our approach is a PLS. Moreover, we have the following definition of (sub)differential of PLS functions on manifolds.

Definition 1 ([46]).

Let $\sigma$ be a PLS function, then

•

the Fr $\acute{e}$ chet subdifferential of $\sigma$ at any $\bm{p}\in\mbox{dom}~{}\sigma$ , denoted as $\hat{\partial}\sigma(\bm{p})$ , is defined as the set of all $\bm{v}\in T_{\bm{p}}\mathcal{M}$ which satisfies

[TABLE]

for geodesic $\gamma$ joining $\gamma(0)=\bm{p}$ and $\gamma(1)=\bm{q}$ . When $\bm{p}\notin\mbox{dom}~{}\sigma$ , we set $\hat{\partial}\sigma(\bm{p})=\emptyset$ .

•

the (limiting) subdifferential of $\sigma$ at any $\bm{p}\in\mathcal{M}$ , denoted as $\partial\sigma(\bm{p})$ , is defined as

[TABLE]

where $\gamma^{k}$ is the geodesic joining $\bm{p}^{k}$ and $\bm{p}$ .

•

$\bm{p}\in\mathcal{M}$ * is a critical point of $\sigma$ if $0\in\partial\sigma(\bm{p})$ . We denote the set of critical points of $\sigma$ by $\mbox{crit}~{}\sigma$ . That is,*

[TABLE]

If $\bm{p}$ is a local minimizer of $\sigma$ then by the Fermat’s rule $0\in\partial\sigma(\bm{p})$ . If $\sigma$ is differentiable, then its subdifferential reduces to a unique gradient, denoted as $\mbox{grad}\sigma$ , which is a vector field satisfying $\left\langle\mbox{grad}\sigma(\bm{p}),~{}\bm{v}\right\rangle_{\bm{p}}=\bm{v}(\sigma)$ for all $\bm{v}\in T_{\bm{p}}\mathcal{M}$ and $\bm{p}\in\mathcal{M}$ . Here $\bm{v}(\sigma)$ denotes the directional derivative of $\sigma$ in the direction $\bm{v}$ . In this case $\partial\sigma(\bm{p})=\{\mbox{grad}\sigma(\bm{p})\}$ . Moreover, we have the following definition of Lipschitz gradients for smooth functions on manifolds:

Definition 2 ([47]).

Let $\sigma:\mathcal{M}\to\mathbb{R}$ be a continuously differentiable function and $L>0$ . $\sigma$ is said to have $L$ -Lipschitz gradient if, for any $\bm{p},~{}\bm{q}\in\mathcal{M}$ and any geodesic segment $\gamma:[0,r]\to\mathcal{M}$ joining $\bm{p}$ and $\bm{q}$ , then

[TABLE]

where $\gamma(0)=\bm{p}$ , $P_{\gamma(0)\gamma(t)}$ is the parallel transport along $\gamma$ from $\bm{p}$ to $\gamma(t)$ , and $l(t)$ denotes the length of the segment between $\bm{p}$ and $\gamma(t)$ . In addition, if $\mathcal{M}$ is a Hadamard manifold, then the last inequality becomes

[TABLE]

Since $\sigma$ is continuously differentiable, $\partial\sigma(\gamma(t))$ and $\partial\sigma(\bm{p})$ are the unique tangent vectors at $\gamma(t)$ and $\bm{p}$ , respectively. Parallel transport thus becomes necessary to move them onto the same tangent space. Note in general, the right hand sides of the two inequalities above are different. This is due to the fact that for non-Hadamard manifolds, geodesic between two points is usually not unique. Since $d(\bm{p},\gamma(t))$ is defined as the infimum length of geodesic segments between $\bm{p}$ and $\gamma(t)$ , it could be smaller than $l(t)$ , which is the length of the segment between $\bm{p}$ and $\gamma(t)$ along a given geodesic $\gamma$ . For Hadamard manifolds on the other hand, there exists a unique geodesic between any two points, hence $d(\bm{p},\gamma(t))=l(t)$ always holds.

2.3 Kurdyka–Łojasiewicz (K-L) property on Riemannian manifolds

The Kurdyka–Łojasiewicz (K-L) property plays a crucial role in nonsmooth analysis [48, 49]. In this subsection we extend the K-L property from Euclidean spaces to Riemannian manifolds. To do this, we need to introduce some basic notations. If $A$ is a subset of $\mathcal{M}$ , then the distance between $\bm{x}\in\mathcal{M}$ and $A$ is defined by

[TABLE]

where $A$ is nonempty, and $\mbox{dist}(\bm{x},A)=+\infty$ for all $\bm{x}\in\mathcal{M}$ when $A$ is empty. For a fixed $\bm{x}\in\mathcal{M}$ , the open ball neighborhood of $\bm{x}$ with radius $\eta$ is defined as $B(\bm{x},\eta):=\{\bm{y}\in\mathcal{M}:d(\bm{x},\bm{y})<\eta\}$ .

Definition 3.

Given real scalars $\alpha$ , $\beta$ , and PLS function $\sigma$ , we define

[TABLE]

We define similarly $[\alpha<\sigma<\beta]$ .

Now, we define the K-L property.

Definition 4 ([49]).

Let $\sigma:\mathcal{M}\to\mathbb{R}\cup\{+\infty\}$ be a PLS function. The function $\sigma$ is said to have K-L property at $\bar{\bm{x}}\in\mbox{dom}~{}\sigma$ if there exists $\eta\in(0,\infty]$ , a neighborhood $U$ of $\bar{\bm{x}}$ and a continuous concave function $\phi:[0,\eta)\to\mathbb{R}_{+}$ such that

(i)

$\phi(0)=0$ , $\phi$ is continuously differentiable on $(0,\eta)$ and $\phi^{\prime}(s)>0$ for all $s\in(0,\eta)$ ;

(ii)

the following K-L inequality holds

[TABLE]

$\forall\bm{x}\in U\cap[\sigma(\bar{\bm{x}})<\sigma<\sigma(\bar{\bm{x}})+\eta]$ .

We call $\sigma$ a K-L function if it has K-L property at each point of $\mbox{dom}~{}\sigma$ .

The K-L property basically asserts that function $\sigma$ can be made sharp by a reparameterization of its values using $\phi$ . In particular, when $\sigma$ is differentiable and $\bar{\bm{x}}$ is critical, i.e., $\partial\sigma(\bar{\bm{x}})=0$ , we can define reparameterization $f(\bm{x}):=\phi(\sigma(\bm{x})-\sigma(\bar{\bm{x}}))$ , then the K-L inequality becomes $\|\partial f(\bm{x})\|\geq 1$ , which avoids flatness around $\bar{\bm{x}}$ . This geometrical feature plays a critical role in proving that the sequence generated by our algorithm converges to a critical point. In Proposition 4 of the supplementary, we also establish K-L property in the neighborhood of non-critical points. K-L functions are ubiquitous in a wide range of applications, including for example semi-algebraic, subanalytic, semiconvex, uniformly convex, and log-exp functions [48, 49].

2.4 Multivariate linear regression with gross errors

Given a matrix representation of $N$ observations $X\in\mathbb{R}^{N\times d}$ , and the corresponding $m$ -dimensional response matrix $Y\in\mathbb{R}^{N\times m}$ , one of the central problems in linear regression is to accurately estimate the regression matrix $V\in\mathbb{R}^{d\times m}$ from

[TABLE]

with $Z\in\mathbb{R}^{N\times m}$ being the stochastic noise. In most of existing work regarding linear regression, $Z$ is assumed to be composed of entries following normal distribution with zero mean. However, when the response $Y$ is subject to possible gross error, the estimated regression matrix deviates significantly from the true value and becomes unreliable. To deal with this problem, several recent works [12, 13, 14] suggest to consider model

[TABLE]

where $G^{*}\in\mathbb{R}^{N\times m}$ is used to explicitly characterize the gross error component. As in practice only a subset of responses are corrupted by gross error, $G^{*}$ is a sparse matrix whose nonzero entries are unknown and magnitudes can be arbitrarily large. Moreover, this model can as well be applied to deal with the case where some entries of $Y$ are missing. A commonly used paradigm of estimating $(V^{*},G^{*})$ is by solving convex optimization problem

[TABLE]

where $\lambda>0$ and $\rho>0$ are tuning parameters, and $R_{v}$ and $R_{g}$ are regularization terms of $V$ and $G$ , respectively. Some frequently used regularization norms include $\ell_{1}$ norm $\|\cdot\|_{1}$ which is the summation of the absolute value of all entries, and $\ell_{1,2}$ norm which is the summation of $\ell_{2}$ norm of rows of a matrix. For example, in [14] the authors propose to use $R_{v}(V)=\|V\|_{1,2}$ and $R_{g}(G)=\|G\|_{1}$ .

3 Our Approach

Consider a set of training examples $\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{N}$ , where $\bm{y}_{i}$ lies on Riemannian manifold $\mathcal{M}$ and $\bm{x}_{i}\in\mathbb{R}^{d}$ is the associated independent variable. We propose a novel extension of the modeling approach of Eq. (2) for Euclidean spaces to deal with the more general curved spaces, as follows.

3.1 From Euclidean spaces to manifolds

The Model of Eq. (2) can be reformulated as $Y-G^{*}=XV^{*}+Z$ . Denote $Y^{c}:=Y-G^{*}$ , which can be interpreted as corrected response after removing the gross error. Now the model of Eq. (2) can be reformulated as standard linear regression in Eq. (1) with response $Y^{c}$ . With this in mind, we proceed to extend the aforementioned idea to regression on manifolds. For each manifold-valued response $\bm{y}_{i}$ , denote as $\bm{y}_{i}^{c}$ its corrected version. Different from the Euclidean space setting where $Y^{c}$ can be obtained from $Y$ simply by a translation, we need to ensure that $\bm{y}_{i}^{c}$ remains on the manifold. This is accomplished by the exponential map $\bm{y}_{i}^{c}=\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i})$ with the gross error $\bm{g}_{i}\in T_{\bm{y}_{i}}\mathcal{M}$ over each of the training examples, $i\in\{1,\cdots,N\}$ . Note that when $\mathcal{M}$ is an Euclidean space, the exponential map reduces to addition, as $\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i})=\bm{y}_{i}+\bm{g}_{i}$ . In other words, translation in the affine space is a special case of exponential map in the more general curved space.

As illustrated in Figure 1, we first obtain the corrected manifold-valued response $\bm{y}_{i}^{c}=\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i})$ . Then the relationship between $\bm{x}_{i}$ and $\bm{y}_{i}^{c}$ can be modeled as

[TABLE]

where $\bm{p}\in\mathcal{M}$ and $\{\bm{v}_{j}\}_{j=1}^{d}\in T_{p}\mathcal{M}$ is a set of tangent vectors at $\bm{p}$ , $x_{i}^{j}$ is the $j^{th}$ component of $\bm{x}_{i}$ , and $\bm{z}_{i}$ is a tangent vector at $\text{Exp}\left(\bm{p},\sum_{j=1}^{d}x_{i}^{j}\bm{v}_{j}\right)$ . Our model can be viewed as a generalization of linear regression model of Eq. (1) from flat spaces to manifolds, where $\bm{p}$ denotes the intercept that is in analogy to the origin [math] in the flat space as in Eq. (1), and exponential map corresponds to the addition operator in Eq. (1).

To measure the training loss, we use

[TABLE]

to denote the sum-of-squared Riemannian distance between the corrected data $\bm{y}_{i}^{c}=\text{Exp}_{\bm{y}_{i}}(\bm{g}_{i})$ and the prediction $\hat{\bm{y}}_{i}=\text{Exp}_{\bm{p}}\left(\sum_{j}x_{i}^{j}\bm{v}_{j}\right)$ , and let $R_{v}$ and $R_{g}$ denote two regularization terms controlling the magnitude of $\{\bm{v}_{j}\}$ and $\{\bm{g}_{i}\}$ , respectively. The problem considered in our paper can now be formulated as the following optimization problem

[TABLE]

where $\lambda\geq 0$ and $\rho\geq 0$ are regularization parameters. Without loss of generality, we consider regularization terms $R_{v}\left(\{\bm{v}_{j}\}\right):=\sum_{j=1}^{d}\left\lVert\bm{v}_{j}\right\rVert_{\bm{p}}$ and $R_{g}\left(\{\bm{g}_{i}\}\right):=\sum_{i=1}^{N}\left\lVert\bm{g}_{i}\right\rVert_{\bm{y}_{i}}$ , with $\left\lVert\cdot\right\rVert_{\bm{p}}$ and $\left\lVert\cdot\right\rVert_{\bm{y}_{i}}$ being the norm of tangent vectors at $\bm{p}$ and $\bm{y}_{i}$ , respectively. There are two reasons for the choice of $R_{v}$ : First, it enables problem of Eq. (7) to contain the multivariate linear regression problems with feature selection in Euclidean spaces as special cases, as shown in Example 1 and Example 2 below; Second, in many applications one may collect a large set of possible variables $\{\bm{x}^{j}\}$ for each response, and want to find a compact subset of base tangent vectors from $\{\bm{v}_{j}\}$ and the corresponding $\{\bm{x}^{j}\}$ that are significant to the manifold-valued output $\bm{y}$ . The choice of $R_{g}$ is based on the assumption that gross errors are usually sporadically spread among data. Now, the optimization problem becomes

[TABLE]

3.2 Connections to existing works

We would like to point out that model in Eq. (11) includes as special cases a number of related research works on gross error or on manifold-valued regression. In this subsection, we provide three such examples.

Example 1.

When $\mathcal{M}=\mathbb{R}^{m}$ , we can establish a connection between the model of Eq. (11) and the robust multi-task regression studied in [14]. Specifically, instead of optimizing Eq. (11) over $\bm{p}\in\mathbb{R}^{m}$ we select $\bm{p}=0$ , resulting in

[TABLE]

which can be rewritten as

[TABLE]

where $\|\cdot\|$ becomes the usual Euclidean norm, $Y=[\bm{y}_{1},\cdots,\bm{y}_{N}]^{\top}\in\mathbb{R}^{N\times m}$ , $X=[\bm{x}_{1},\cdots,\bm{x}_{N}]^{\top}\in\mathbb{R}^{N\times d}$ , $V=[\bm{v}_{1},\cdots,\bm{v}_{d}]^{\top}\in\mathbb{R}^{d\times m}$ and $G=[\bm{g}_{1},\cdots,\bm{g}_{N}]^{\top}\in\mathbb{R}^{N\times m}$ . The resulting model of Eq. (12) is exactly the one considered in [14] except that regularization term $\left\lVert G\right\rVert_{1}$ in [14] is replaced by $\left\lVert G\right\rVert_{1,2}$ here.

Example 2.

If $\mathcal{M}=\mathbb{R}^{m}$ , we can show by Fermat’s rule that the optimal solution $\tilde{\bm{p}}$ is given by $\tilde{\bm{p}}=\frac{1}{N}\sum\limits_{i=1}^{N}(\bm{y}_{i}+\bm{g}_{i}-\sum_{j}\bm{x}_{i}^{j}\bm{v}_{j})$ . By substituting $\tilde{\bm{p}}$ into problem of Eq. (11) and assuming $\{(\bm{x}_{i},\bm{y}_{i})\}$ has empirical mean [math], that is, $\sum_{i=1}^{N}\bm{x}_{i}=0$ and $\sum_{i=1}^{N}\bm{y}_{i}=0$ , the optimization problem of Eq. (11) reduces to

[TABLE]

which can be reformulated as

[TABLE]

where $\mathbbm{1}_{N}\in\mathbb{R}^{N}$ is a column vector with all entries being 1.

The difference between Example 1 and Example 2 lies in that the former is obtained from selecting $\bm{p}=0$ while the latter is from optimizing $\bm{p}$ which exactly follows model of Eq. (11). The resulting models are quite similar except that model of Eq. (2) needs to center variable $G$ .

Example 3.

If we let $\lambda=0$ and $\rho=+\infty$ , then optimization problem of Eq. (11) reduces to

[TABLE]

which recovers exactly the model considered in [2]. In this regard, the MGLM model in [2] is a special case of our model.

3.3 PALM for optimization on Hadamard manifolds

In this subsection, we propose a new algorithm to solve optimization problem of Eq. (11), which is actually a nonsmooth optimization problem on Hadamard manifolds. As explained in details in subsection 3.4, problem of Eq. (11) admits the form

[TABLE]

where $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are Hadamard manifolds, $f:\mathcal{M}_{1}\rightarrow\mathbb{R}\cup\{+\infty\}$ and $g:\mathcal{M}_{2}\rightarrow\mathbb{R}\cup\{+\infty\}$ are PLS functions, and $h:\mathcal{M}_{1}\times\mathcal{M}_{2}\rightarrow\mathbb{R}$ is a smooth function.

Many existing optimization techniques are developed to work with Euclidean spaces, thus not directly applicable to curved manifolds. Meanwhile, an increasing amount of attention has been drawn to the field of optimization on manifolds [50]. For smooth optimization, classical optimization techniques, such as gradient, conjugate gradients, and trust-region methods, have been generalized to the manifold setting [50, 51, 52, 53], which are however not suitable for the nonconvex and nonsmooth optimization manifold-based problem of Eq. (11). For nonsmooth optimization, there exist many prior works [54, 55, 56, 57]. Unfortunately they either cannot exploit the composition structure in Eq. (14) (e.g., [54, 55, 57]), or fail to guarantee convergence (e.g., [56]).

Recently, a proximal alternating linearized minimization (PALM) algorithm has been proposed in [16] for optimization problem of Eq. (14) with $\mathcal{M}_{1}=\mathbb{R}^{n}$ and $\mathcal{M}_{2}=\mathbb{R}^{m}$ . Inspired by the success of PALM in the Euclidean setting, in what follows we propose PALMR, an inexact proximal alternating minimization algorithm for problem of Eq. (14).

We alternately solve the following two proximally linearized subproblems

[TABLE]

where $c_{k}=\mu_{1}L_{1}(\bm{y}^{k})$ and $d_{k}=\mu_{2}L_{2}(\bm{x}^{k+1})$ with $\mu_{1}>1$ , $\mu_{2}>1$ and $L_{1}(\bm{y}^{k})$ , $L_{2}(\bm{x}^{k+1})$ being the Lipschitz constants of $\partial_{\bm{x}}h$ and $\partial_{\bm{y}}h$ , respectively, as to be explained in Assumption 1. In particular, by exploiting the fact that $\mathcal{M}_{1}$ is a Hadamard manifold on which any two points can be joined by a unique geodesic, we have a one-to-one mapping between $\bm{v}\in T_{\bm{x}^{k}}\mathcal{M}_{1}$ and $\bm{x}\in\mathcal{M}_{1}$ such that $\bm{x}=\mbox{Exp}_{\bm{x}^{k}}(\bm{v})$ , $\bm{v}=\mbox{Exp}_{\bm{x}^{k}}^{-1}\bm{x}$ and $d_{\mathcal{M}_{1}}(\bm{x}^{k},\bm{x})=\|\bm{v}\|$ . Thus, a simple substitution reformulates Eq. (15) as

[TABLE]

or equivalently,

[TABLE]

which becomes an optimization problem in linear space $T_{\bm{x}^{k}}\mathcal{M}_{1}$ , and as a result, we have $\bm{x}^{k+1}=\mbox{Exp}_{\bm{x}^{k}}(\bm{v}^{k})$ . Since $f$ is PLS satisfying $\inf_{\bm{x}\in\mathcal{M}}f(\bm{x})>-\infty$ and $\mbox{Exp}_{\bm{x}^{k}}$ is smooth, it follows that the composite function $f\circ\mbox{Exp}_{\bm{x}^{k}}$ is PLS and $\inf_{\bm{v}\in T_{\bm{x}^{k}}\mathcal{M}}f\circ\mbox{Exp}_{\bm{x}^{k}}(\bm{v})>-\infty$ which, together with Theorem 1.25 of [58], implies that $\bm{v}^{k}$ is well-defined. Moreover, the above optimization problem for $\bm{v}^{k}$ is called proximity operator [59], denoted as

[TABLE]

Similar claims apply to problem of Eq. (16), implying the well-definiteness of $\bm{x}^{k+1}$ and $\bm{y}^{k+1}$ . Solving Eqs. (15) and (16) alternately yields the algorithm PALMR outlined in Algorithm 1.

To analyze the convergence of PALMR, we need the following assumptions.

Assumption 1.

$\Psi(\bm{x},\bm{y})$ * satisfies the following conditions:*

(i)

$\inf f>-\infty$ , $\inf g>-\infty$ and $\inf\Psi>-\infty$ .

(ii)

For any fixed $\bm{y}$ , the function $\bm{x}\to h(\bm{x},\bm{y})$ has $L_{1}(\bm{y})$ -Lipschitz gradient. Likewise, for any fixed $\bm{x}$ , the function $\bm{y}\to h(\bm{x},\bm{y})$ has $L_{2}(\bm{x})$ -Lipschitz gradient. Moreover, there exist real scalars $\lambda_{i}^{-},\lambda_{i}^{+}>0$ for $i=1,2$ , such that

[TABLE]

(iii)

$\partial h$ * is Lipschitz continuous on bounded subset of $\mathcal{M}_{1}\times\mathcal{M}_{2}$ . More specifically, for bounded subset $A_{1}\times A_{2}\in\mathcal{M}_{1}\times\mathcal{M}_{2}$ , there exists constant $L>0$ such that for all $(\bm{x}_{i},\bm{y}_{i})\in A_{1}\times A_{2}$ , $i=1,2$ , we have*

[TABLE]

(iv)

$\Psi(\bm{x},\bm{y})$ * has the Kurdyka–Łojasiewicz (K-L) property on Hadamard manifolds.*

Assumption (i) establishes that proximal operators in Eqs. (15) and (16) are well-defined, leading to the well-definedness of algorithm PALMR. Assumption (ii) provides that $h$ is locally block-Lipschitz continuous, and the boundedness of Lipschitz constants are to ensure sufficient decrease of objective function value over iterations. Assumption (iii) considers the partial gradients of $h$ being Lipschitz continuous, which would be used to derive lower bound for the iteration gap $d(\bm{x}^{k+1},\bm{x}^{k})$ + $d(\bm{y}^{k+1},\bm{y}^{k})$ . Assumption (iv) guarantees that $\{(\bm{x}^{k},\bm{y}^{k})\}$ form a Cauchy sequence.

Under Assumption 1 we have the following theorem, whose proof is provided in the supplementary.

Theorem 1.

Suppose Assumption 1 holds. Let $\{(\bm{x}^{k},\bm{y}^{k})\}_{k\in\mathbb{N}}$ be a sequence generated by PALMR. Then either the sequence $\{d_{\mathcal{M}_{1}\times\mathcal{M}_{2}}((\bm{x}^{0},\bm{y}^{0}),(\bm{x}^{k},\bm{y}^{k}))\}$ is unbounded or the following assertions hold:

The sequence $\{(\bm{x}^{k},\bm{y}^{k})\}_{k\in\mathbb{N}}$ has finite length, i.e. $\sum\limits_{k}d_{\mathcal{M}_{1}}(\bm{x}^{k+1},\bm{x}^{k})<\infty$ and $\sum\limits_{k}d_{\mathcal{M}_{2}}(\bm{y}^{k+1},\bm{y}^{k})<\infty$ .

2)

The sequence $\{(\bm{x}^{k},\bm{y}^{k})\}_{k\in\mathbb{N}}$ converges to a critical point $(\bm{x}^{*},\bm{y}^{*})$ of $\Psi$ .

Based on Theorem 1, we know that the sequence $\{(\bm{x}^{k},\bm{y}^{k})\}$ generated by PALMR converges to a critical point of $\Psi$ , provided the boundedness of the sequence. As shown in [16], there are many scenarios where such assumption holds. For example, when functions $f$ and $g$ are convex and $h(\bm{x},\bm{y})=\|A\bm{x}-B\bm{y}\|$ where $A$ and $B$ are matrices, then the sequence $\{(\bm{x}^{k},\bm{y}^{k})\}$ is bounded.

In what follows, we specifically investigate the dedicated realization of PALMR to solve the optimization problem of Eq. (11). To simplify the notation, the resulting algorithm is also referred to as PALMR when there is no confusion.

3.4 Applying PALMR to optimization problem of Eq. (11)

Optimization problem of Eq. (11) in our context can be reformulated as

[TABLE]

which is of the form in Eq. (14) with $\mathcal{M}_{1}=\mathcal{M}\times T\mathcal{M}$ and $\mathcal{M}_{2}=T_{\bm{y}_{1}}\mathcal{M}\times\cdots\times T_{\bm{y}_{N}}\mathcal{M}$ . To apply PALMR to solve problem of Eq. (11), we need to evaluate the gradients of $E(\bm{p},\{\bm{v}_{j}\},\{\bm{g}_{i}\})$ . To simplify the notation, we further denote the prediction $\hat{\bm{y}}_{i}:=\mathrm{Exp}_{\bm{p}}\big{(}\sum_{j}x_{i}^{j}\bm{v}_{j}\big{)}$ , as well as the derivatives of the exponential map with respect to $\bm{p}$ and $\bm{v}$ as $d_{\bm{p}}\mathrm{Exp}_{\bm{p}}(\bm{v})$ and $d_{\bm{v}}\mathrm{Exp}_{\bm{p}}(\bm{v})$ , respectively. Now, the partial gradient of $E$ with respect to $\bm{p}$ amounts to

[TABLE]

where $(\cdot)^{{\dagger}}$ is the adjoint derivative of the exponential map [28] defined by $\big{\langle}\bm{\mu},d_{\bm{p}}\mathrm{Exp}_{\bm{p}}(\bm{v})\bm{w}\big{\rangle}_{\mathrm{Exp}_{\bm{p}}(\bm{v})}=\big{\langle}\big{(}d_{\bm{p}}\mathrm{Exp}_{\bm{p}}(\bm{v})\big{)}^{{\dagger}}\bm{\mu},\bm{w}\big{\rangle}_{\bm{p}}$ with $\bm{\mu}\in T_{\mathrm{Exp}_{\bm{p}}(\bm{v})}\mathcal{M}$ , $\bm{w}\in T_{\bm{p}}\mathcal{M}$ . The adjoint derivative operator maps $\mathrm{Exp}_{\hat{\bm{y}}_{i}}^{-1}(\bm{y}_{i}^{c})$ from the tangent space of $\hat{\bm{y}}_{i}$ to the tangent space of $\bm{p}$ . Thus $\partial_{\bm{p}}E\in T_{\bm{p}}\mathcal{M}$ . Similarly, the partial gradient of $E$ with respect to $\bm{v}_{j}$ and $\bm{g}_{i}$ are given by

[TABLE]

and

[TABLE]

respectively.

The PALMR algorithm for problem of Eq. (11) proceeds as follows: To update $(\bm{p},\{\bm{v}_{j}\})$ , we let $\partial_{\bm{p}}E^{k}:=\partial_{\bm{p}}E(\bm{p}^{k},\{\bm{v}_{j}^{k}\},\{\bm{g}_{i}^{k}\})$ and $\partial_{\bm{v}_{j}}E^{k}:=\partial_{\bm{v}_{j}}E(\bm{p}^{k},\{\bm{v}_{j}^{k}\},\{\bm{g}_{i}^{k}\})$ and solve

[TABLE]

where $P_{\bm{p}\bm{p}^{k}}$ is the parallel transport from $\bm{p}$ to $\bm{p}^{k}$ along the unique geodesic between them. Due to the constraint $\bm{v}\in T_{\bm{p}}\mathcal{M}$ , it is difficult to solve $\bm{p}$ and $\bm{v}_{j}$ together. Instead, the above subproblem is solved by alternating minimization over $\bm{p}$ and $\bm{v}_{j}$ . Specifically, to update $\bm{p}$ , we solve

[TABLE]

which, by a change of variable $\bm{u}=\mathrm{Exp}_{\bm{p}^{k}}^{-1}\bm{p}$ , is equivalent to solving

[TABLE]

and $\bm{p}^{k+1}=\mathrm{Exp}_{\bm{p}^{k}}(\bm{u}^{k})$ .

To update $\{\bm{v}_{j}\}$ , we need to first obtain $\hat{\bm{v}}_{j}^{k}$ by

[TABLE]

where $\bm{s}_{j}^{k}=\bm{v}_{j}^{k}-\frac{1}{c_{k}}\partial_{\bm{v}_{j}}E^{k}$ . Notice that the above optimization problem have closed form solution of

[TABLE]

where $(\alpha)_{+}=\alpha$ if $\alpha>0$ and [math] otherwise. Since $\{\hat{\bm{v}}_{j}^{k}\}$ lie on the tangent space at $\bm{p}^{k}$ , we need to parallel transport them to $T_{\bm{p}^{k+1}}\mathcal{M}$ by $\bm{v}_{j}^{k+1}=P_{\bm{p}^{k}\bm{p}^{k+1}}(\hat{\bm{v}}_{j}^{k})$ along the unique geodesic between $\bm{p}^{k}$ and $\bm{p}^{k+1}$ .

Similarly, update $\{\bm{g}_{i}\}$ by

[TABLE]

where $\bm{t}_{i}^{k}=\bm{g}_{i}^{k}-\frac{1}{e_{k}}\partial_{\bm{g}_{i}}E(\bm{p}^{k+1},\{\bm{v}_{j}^{k+1}\},\{\bm{g}_{i}^{k}\})$ .

Now, we are ready to present our algorithm for multivariate regression with grossly corrupted manifold-valued data, as shown in Algorithm 2. Notice that when letting $\lambda=0$ and $\rho=+\infty$ , Algorithm 2 alternately updates the values of $\bm{p}$ and $\bm{v}_{j}$ via three steps: (1) $\bm{p}^{k+1}=\mathrm{Exp}_{\bm{p}^{k}}\left(-\frac{1}{c_{k}}\partial_{\bm{p}}E^{k}\right)$ , (2) $\hat{\bm{v}}_{j}^{k}=\bm{v}_{j}^{k}-\frac{1}{c_{k}}\partial_{\bm{v}_{j}}E^{k}$ , (3) $\bm{v}_{j}^{k+1}=P_{\bm{p}^{k}\bm{p}^{k+1}}(\hat{\bm{v}}_{j}^{k})$ , which recovers the gradient descent method proposed in [2].

3.5 Implementation of Algorithm 2

During each iteration of Algorithm 2, the partial derivatives $\partial_{\bm{p}}E$ , $\partial_{\bm{v}_{j}}E$ and $\partial_{\bm{g}_{i}}E$ of Eqs. (17), (18), and (19) are evaluated. Their detailed derivations are provided in Section 3 of the supplementary file. Nevertheless, these terms could be practically intractable to compute for some manifolds, due to the presence of adjoint derivatives of the exponential map. As a remedy to this issue, we adopt the variational technique of [2, 60] for computing derivatives, which basically replaces the adjoint derivative operators by parallel transports:

[TABLE]

One advantage of such approximation is that for some special manifolds, including manifold of SPD matrices $\mathcal{S}_{++}(n)$ , parallel transports have analytical expressions and can be computed directly. For general manifolds that have no analytical expressions for parallel transports, approximation approaches such as Schild’s ladder approximation [61, 62] can be used. The method approximates parallel transport by constructing geodesic parallelograms, which requires three exponential maps and two inverse exponential maps, as shown in Figure 2.

4 Experiments

In this section, we empirically evaluate the performance of the proposed approach (i.e. PALMR) in working with synthetic and real DTI data sets, which lies in the $\mathcal{S}_{++}(3)$ manifold of SPD matrices. Throughout all experiments, we fix $\lambda=0.1$ and choose the optimal $\rho$ from set $\{0.05,0.1,\cdots,0.95,1\}$ by a validation process using a validation data set consisting of the same number of data points as the testing data. As our algorithm is iterative by nature, in practice it stops if either of the two stopping criteria is met: (1) the difference between consecutive objective function values is below 1e-5, or (2) maximum number of iterations (100) is reached.

4.1 Synthetic DTI data

Synthetic DTI data sets are constructed with known ground-truths and gross errors as follows: First, we randomly generate $\bm{p}\in\mathcal{S}_{++}(3)$ , symmetric matrices $\{\bm{v}_{j}\}_{j=1}^{d}\subseteq\mathcal{S}(3)$ and $\{\bm{x}_{i}\}_{i=1}^{N}\subseteq\mathbb{R}^{d}$ where entries of $\bm{x}_{i}$ are sampled from standard normal distribution $\mathcal{N}(0,1)$ . Then the ground-truth DTI data is obtained as $\bm{y}^{t}_{i}:=\text{Exp}_{\bm{p}}\left(\sum_{j=1}^{d}x_{i}^{j}\bm{v}_{j}\right)$ . This is followed by DTI data with stochastic noise as $\bm{y}^{s}_{i}:=\text{Exp}_{\bm{y}^{t}_{i}}\left(\bm{z}_{i}\right)$ , where $\bm{z}_{i}$ is a random matrix in $\mathcal{S}(3)$ with its entries being sampled from $\mathcal{N}(0,1)$ and satisfies $\left\lVert\bm{z}_{i}\right\rVert_{\bm{y}^{t}_{i}}\leq 0.1$ . Meanwhile, the gross errors are generated by a two-step process: (a) Randomly select an index subset $I_{g}$ from $\{1,2,\cdots,N\}$ , such that $|I_{g}|=\beta*N$ with $0\leq\beta\leq 1$ . (b) For $i\in I_{g}$ , its grossly corrupted response is attained by $\bm{y}_{i}=\text{Exp}_{\bm{y}^{s}_{i}}\left(\bm{g}_{i}\right)$ , where $\bm{g}_{i}$ is a random matrix in $\mathcal{S}(3)$ satisfying $\left\lVert\bm{g}_{i}\right\rVert_{\bm{y}^{s}_{i}}=\sigma_{g}$ . The rest of the training data remain unchanged, i.e. $\bm{y}_{i}=\bm{y}^{s}_{i}$ for $i\notin I_{g}$ . Thus, among all $N$ manifold-valued data, the percentage of grossly corrupted data is $\beta$ . With the same $\bm{p}$ and $\{\bm{v}_{j}\}$ , we also generate $N_{t}$ pairs of testing data $\{(\bm{x}^{test}_{i},\bm{y}^{test}_{i})\}$ and validation data.

We first conduct experiments on a data set with $d=2$ , $N=50$ , $\beta=40\%$ and $\sigma_{g}=5$ , and compared with the multivariate general linear model (MGLM) of [2] which has not considered gross error. Training samples are displayed in Figure 3, where we also show the predictions of PALMR and MGLM on the training data. Visual results of PALMR and MGLM on 20 testing data and training data correction by PALMR are presented in Figure 4(a) and Figure 4(b), respectively. Collectively, the results suggest that PALMR indeed is capable of correctly identifying the gross errors during training. This enables the delivery of a better-behaved model. Figure 4(b) shows that PALMR can effectively recover the original data (i.e. true data without gross error). It also produces improved regression results on testing data as displayed in Figure 4(a).

Next we quantitatively evaluate the effect of varying the internal parameters of PALMR, which include the number of independent variables $d$ , the number of training data $N$ , magnitude of gross error $\sigma_{g}$ , and percentage of grossly corrupted training data $\beta$ . To see the effect of one specific parameter, synthetic DTI data are generated by varying this parameter value while keeping rest parameters at their default values. The following default values are used: $d=2$ , $N=50$ , $\beta=20\%$ , and $\sigma_{g}=1$ . To evaluate performance of PALMR, the following mean squared geodesic error (MSGE) metrics are considered: $\text{MSGE}_{train}:=\frac{1}{N}\sum_{i}d^{2}(\bm{y}_{i},\hat{\bm{y}}_{i})$ , $\text{MSGE}_{test}:=\frac{1}{N_{t}}\sum_{i}d^{2}(\bm{y}^{test}_{i},\hat{\bm{y}}^{test}_{i})$ , $\text{MSGE}_{\bm{p}}:=d^{2}(\bm{p},\tilde{\bm{p}})$ , $\text{MSGE}_{V}:=\frac{1}{d}\sum_{j}\|\bm{v}_{j}-P_{\tilde{\bm{p}}\bm{p}}(\tilde{\bm{v}}_{j})\|_{\bm{p}}^{2}$ , and $\text{MSGE}_{G}:=\frac{1}{N}\sum_{i}\|\mbox{Exp}_{\bm{y}_{i}}^{-1}(\bm{y}^{s}_{i})-\tilde{\bm{g}}_{i}\|_{\bm{y}_{i}}^{2}$ , where $\tilde{\bm{p}}$ , $\tilde{\bm{v}}_{j}$ and $\tilde{\bm{g}}_{i}$ are the outputs of Algorithm 2. The data correction error is measured as $\frac{1}{N}\sum_{i}d(\bm{y}^{s}_{i},\bm{y}^{c}_{i})^{2}$ . In addition, we say that gross error $\bm{g}_{i}$ is correctly identified if both $\bm{g}_{i}$ and $\tilde{\bm{g}}_{i}$ are either zero or nonzero, and compute the rate $Rate_{G}:=\mbox{{number of correctly identified gross errors}}/N$ . Results averaged over 10 repetitions are presented in Figure 5, where each column corresponds to the effect of one parameter and each row corresponds to the results using one metric.

From Figure 5, we have four observations: (1) PALMR has lower MSGE for all values of $d$ , and our correction performs well on training data, cf. column Figure 5(a). (2) PALMR has large advantage over MGLM for all values of training size ( $N$ ) and magnitude of gross error ( $\sigma_{g}$ ), cf. columns Figure 5(b-c). (3) PALMR can handle training data with up to $80\%$ being grossly corrupted, and delivers better result than MGLM. On the other hand, the performance is slightly worse if more than $80\%$ of training data are corrupted, cf. column Figure 5(d). (4) PALMR can reliably identify most of the gross errors. Still it may not always correctly recover the true value of the error. This is evidenced in the last row of Figure 5, where the MSGE on $G$ increases as $\sigma_{g}$ or $\beta$ increases, and our correction error starts to stand out (i.e. being larger than both prediction errors of PALMR and MGLM) when over $\beta=30\%$ of the training samples are grossly corrupted. We believe this is acceptable as in most practical situations, only small fraction of the training examples would be contaminated by gross errors.

Finally, we compare the proposed method PALMR and MGLM with an Euclidean multivariate linear regression model with gross errors described in equation (12) of Example 1. All experimental settings are the same as above except three aspects: (i) Since the Euclidean model can not deal with DTI tensors directly, for each tensor $\bm{y}$ , we vectorize its upper triangle part into a 6-dimensional vector. Therefore, $X\in\mathbb{R}^{50\times 2}$ and $Y\in\mathbb{R}^{50\times 6}$ in model (12). (ii) Since predictions of the Euclidean model are not guaranteed to lie on the SPD manifold, the geodesic metrics are not applicable. As alternate, we adopt Frobenious norm distance $\|\bm{y}-\hat{\bm{y}}\|_{F}$ to measure the distance between prediction $\hat{\bm{y}}$ and ground-truth $\bm{y}$ . (iii) We only investigate the effect of the magnitude of gross errors and the ratio of gross errors in the training data. Results are shown in Figure 6, where the $y$ -axis in each plot denotes the log-scale of median error over 10 reptitions measured by Frobenious norm. We observe that PALMR achieves the best performance and outporforms the Euclidean model by a large margin under various settings. MGLM also performs better than the Euclidean model, but when there are large gross errors in the training data, its advantage disappears, as can be seen in the left plot. These observations are within our expectation, since the Euclidean model does not respect the intrinsic structure of the DTI data.

4.2 Real DTI data

In this section, we apply PALMR to examine the effect of age and gender on human brain white matter. We experiment with the C-MIND database 333https://cmind.research.cchmc.org released by Cincinnati Children’s Hospital Medical Center (CCHMC) with the purpose of investigating brain development in children from infants and toddlers (0 $\sim$ 3 years) through adolescence (18 years). We use the imaging data of participants who were scanned at CCHMC at year one and whose age were between 8 and 18 (2947 to 6885 days), consisting of 27 female and 31 male. The DTI data of each subject are first manually inspected and corrected for subject movements and eddy current distortions using FSL’s eddy tool [63], then passed to FSL’s brain extraction tool to delete non-brain tissue 444http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/. After the pre-processing, we use FSL’s DTIFIT tool to reconstruct DTI tensors. Finally, all DTIs are registered to a population specific template constructed using DTI-TK 555http://dti-tk.sourceforge.net/pmwiki/pmwiki.php. We investigate six exemplar slices that have been identified as typical slices by domain experts and have been also similarly used by many existing works such as [2, 27]. And in particular, we are interested in the white matter region. At each voxel within the white matter region, the following multivariate regression model

[TABLE]

is adopted to describe the relation between the DTI data $\bm{y}$ and variables ‘age’ and ‘gender’.

In DTI studies, another frequently used measure of a tensor is fractional anisotropy (FA) [64, 65] defined as

[TABLE]

where $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are eigenvalues of the tensor. FA is an important measurement of diffusion asymmetry within a voxel and reflects fiber density, axonal diameter, and myelination in white matter. In our experiments, we also compared three models: two geodesic regression models, MGLM and PALMR, and the FA regression model which uses FA value to replace tensor $\bm{y}$ in Eq. (23). The relative FA error metric is employed to compare the results of geodesic regressions and FA regression, as follows: Since the responses of geodesic regression are tensors, the FA values of the tensors can be computed. The relative FA error metric is then evaluated on testing data, which is defined as the mean relative error between the FA values of the predicted tensors and the true tensors. Besides this relative FA error metric, the aforementioned mean squared geodesic error (MSGE) on testing data as in subsection 4.1 is still engaged to compare the performance of MGLM and PALMR.

4.2.1 Model significance

To examine the significance of the statistical model of Eq. (23) considered in our approach, the following hypothesis test is performed. The null hypothesis is $H_{0}:\bm{v}_{1}=0$ , which means, under this hypothesis, age has no effect on the DTI data. We randomly permute the values of age 666Empirical results investigating the effect of ‘gender’ are provided in Section 5 of the supplementary file. among all samples and fix the DTI data, then apply model of Eq. (23) to the permuted data and compute the mean squared geodesic error $MSGE_{perm}=\frac{1}{N}\sum_{i}\mbox{dist}(\bm{y}_{i},\hat{\bm{y}}^{p}_{i})^{2}$ , where $\hat{\bm{y}}^{p}_{i}$ is the prediction of PALMR on the permuted data. Repeat the permutation $T=1000$ times, we get a sequence of errors $\{MSGE_{perm}^{i}\}_{i=1}^{T}$ and calculate a $p$ -value at each voxel using $p\mbox{-value}:=\frac{|\{i\mid MSGE_{perm}^{i}<MSGE_{train}\}|}{T}$ . Figure 7 presents the maps of voxel-wise $p$ -values for three models using six typical slices, and Figure 8 displays the distribution of $p$ -values for all six slices collectively.

As shown in Figure 7 and Figure 8, geodesic regression models are able to capture more white matter regions with aging effects than FA regression model. In addition, voxels satisfying $p$ -value $\leq 0.05$ are more spatially contiguous when geodesic regression models are used, as can be seen from the zoom-in plot for each slice in Figure 7. This may be attributed to the fact that geodesic regression models preserve more geometric information of tensor images than that of FA regression. We also observe that PALMR and MGLM obtain very similar results. This is to be expected, as both methods use model of Eq. (23) and adopt geodesic regression on manifolds. The main difference is that PALMR considers gross error while MGLM does not, and in this experiment, there is no gross error in the DTI data.

4.2.2 Model predictability

We proceed to investigate the predictability of PALMR when compared with existing methods such as FA regression and MGLM. For each of the six slices, we randomly partition our data into 40 training (20 female + 20 male) and 18 testing (7 female + 11 male) data, then train all three methods on each voxel within the white matter region. To test the ability of PALMR in handling gross errors, we consider three different experimental settings: (1) No gross error, where all training data are fully preprocessed as described at the beginning of subsection 4.2; (2) 20% manual gross error, where for each voxel we randomly select $20\%$ of training instances and insert gross error with magnitude $\sigma_{g}=5$ ; (3) 20% registration error, where $20\%$ of the patients in the training data are randomly selected to undergo an incomplete registration processing. Compared with fully preprocessed data, DTI data with registration error are obtained by skipping the diffeomorphic registration step in DTI-TK. The purpose of experimenting on data with registration error is to imitate the realistic scenario that gross error can be caused by improper preprocessing of the data. We should remark that registration error is more challenging to handle than the manual gross error, since its magnitude varies dramatically for different voxels and patients. A heat map of registration error for each slice is provided in Fig. 1 of the supplementary file. In this case, instead of considering all voxels on each slice, we set a threshold value $\omega$ and consider those voxels whose minimum registration error is greater than $\omega$ . For the first four slices, we set $\omega=0.7$ and for the last two slices we set $\omega=0.5$ . The three comparison methods are examined on the three types of training data, and for each voxel the experiments are repeated 10 times.

TABLE I provides the median values of prediction errors measured with both relative FA error and MSGE on all voxels and over all six slices. As clearly indicated in TABLE I, geodesic regression models again outperform FA regression model, which is to be expected. Moreover, when there is no gross error in the training data, both MGLM and PALMR achieve similar results. This is consistent with the claim that MGLM is a special case of PALMR when there is no gross error. In addition, the ‘20% manual gross error’ column shows that when $20\%$ of the training data contain gross errors PALMR outperforms MGLM by a large margin. For the challenging case of 20% registration error, the last column of TABLE I shows that PALMR is still much better than its competitors. In Figure 9, we use box plots to demonstrate the performance advantage of PALMR over its competitors. For each metric, the performance improvement is computed as (error of the best competitor - error of PALMR) / error of the best competitor * 100%. Figure 9 displays the same results as in TABLE I from a different perspective and with more details. We first compute the performance improvement of PALMR on each voxel of all six slices to get a percentage value, then put all values under the same metric and experimental setting to plot a box plot. Figure 9 shows that PALMR improves the median prediction error by at least 20% and 15% in the case of manual gross error and registration error, respectively.

The distribution of prediction errors measured by the relative FA error and the MSGE on each slice is shown in Figure 10 and Figure 11, respectively. In each plot, the method with corresponding distribution on the left is better than the one with corresponding distribution on the right. From both Figure 10 and Figure 11, we get similar observation as in TABLE I. Moreover, Figure 11 shows that PALMR is more robust to gross errors than its competitors. In Fig. 2 of the supplementary file, we also show the comparison of prediction errors of MGLM and PALMR on each voxel of all slices. We observe that on most of the voxels PALMR is better than MGLM when gross errors are present. More experimental results on real DTI data are available in Section 5 of the supplementary file.

5 Conclusion and Future Work

This paper focuses on the interesting problem of multivariate regression on manifolds with gross error contamination, where mathematical formulation nevertheless resides in a challenging landscape concerning a nonconvex and nonsmooth optimization on manifolds. A new algorithm, PALMR, is proposed to address this problem and its convergence property is analyzed. Through empirical studies, PALMR is shown to be capable of dealing with the presence of gross error and produces reliable results. For future work, there are several directions to explore. In terms of theoretical study, it remains to investigate the recoverbility of the proposed model, that is, to study conditions under which our model can correctly locate gross errors and recover their magnitude. It is also of interest to analyze the asymptotic behaviour of the resulting estimators. In terms of applications, in addition to age and gender, one may also consider the influence of handedness (i.e. left- or right-handed) on DTI responses. We also plan to apply our framework to different applications including shape analysis and robotics, where the manifolds of interest could be $SO(3)$ and $SE(3)$ .

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. C. Davis, T. Fletcher, E. Bullitt, and S. C. Joshi, “Population shape regression from random design data,” International Journal of Computer Vision , vol. 90, no. 2, pp. 255–66, 2010.
2[2] H. Kim, N. Adluru, M. Collins, M. Chung, B. Bendlin, S. Johnson, R. Davidson, and V. Singh, “Multivariate general linear models (MGLM) on Riemannian manifolds with applications to statistical analysis of diffusion weighted images,” in CVPR , 2014.
3[3] E. Cornea, H. Zhu, P. Kim, J. Ibrahim, and the Alzheimer’s Disease Neuroimaging Initiative, “Regression models on Riemannian symmetric spaces,” Journal of the Royal Statistical Society: Series B (Statistical Methodology) , vol. 79, no. 2, pp. 463–482, 2017.
4[4] P. Muralidharan and P. T. Fletcher, “Sasaki metrics for analysis of longitudinal data on manifolds,” in CVPR , 2012.
5[5] J. Hsu, A. Leemans, C. Bai, C. Lee, Y. Tsai, H. Chiu, and W. Chen, “Gender differences and age-related white matter changes of the human brain: A diffusion tensor imaging study,” Neuro Image , vol. 39, no. 2, pp. 566–577, 2008.
6[6] M. Wu, L.-C. Chang, L. Walker, H. Lemaitre, A. Barnett, S. Marenco, and C. Pierpaoli, “Comparison of EPI distortion correction methods in diffusion tensor MRI using a novel framework,” in MICCAI , 2008.
7[7] A. Zalesky, “Moderating registration misalignment in voxelwise comparisons of DTI data: a performance evaluation of skeleton projection,” Magnetic Resonance Imaging , vol. 29, no. 1, pp. 111–125, 2011.
8[8] M. Bastin, P. Armitage, and I. Marshall, “A theoretical study of the effect of experimental noise on the measurement of anisotropy in diffusion imaging,” Magnetic Resonance Imaging , vol. 16, no. 7, pp. 773–785, 1998.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Multivariate Regression with Gross Errors

Abstract

Index Terms:

1 Introduction

1.1 Related work

2 Background

2.1 Riemannian manifolds

2.2 Nonsmooth analysis on Riemannian manifolds

Definition 1** ([46]).**

Definition 2** ([47]).**

2.3 Kurdyka–Łojasiewicz (K-L) property on Riemannian manifolds

Definition 3**.**

Definition 4** ([49]).**

2.4 Multivariate linear regression with gross errors

3 Our Approach

3.1 From Euclidean spaces to manifolds

3.2 Connections to existing works

Example 1**.**

Example 2**.**

Example 3**.**

3.3 PALM for optimization on Hadamard manifolds

Assumption 1**.**

Theorem 1**.**

3.4 Applying PALMR to optimization problem of Eq. (11)

3.5 Implementation of Algorithm 2

4 Experiments

4.1 Synthetic DTI data

4.2 Real DTI data

4.2.1 Model significance

4.2.2 Model predictability

5 Conclusion and Future Work

Definition 1 ([46]).

Definition 2 ([47]).

Definition 3.

Definition 4 ([49]).

Example 1.

Example 2.

Example 3.

Assumption 1.

Theorem 1.