Optimal Recovery of Precision Matrix for Mahalanobis Distance from High   Dimensional Noisy Observations in Manifold Learning

Matan Gavish; Ronen Talmon; Pei-Chun Su; Hau-Tieng Wu

arXiv:1904.09204·math.ST·September 13, 2021

Optimal Recovery of Precision Matrix for Mahalanobis Distance from High Dimensional Noisy Observations in Manifold Learning

Matan Gavish, Ronen Talmon, Pei-Chun Su, Hau-Tieng Wu

PDF

Open Access

TL;DR

This paper investigates the estimation of Mahalanobis distance and its precision matrix from noisy high-dimensional data, providing theoretical insights, optimal shrinkage methods, and applications to manifold learning and dynamical systems.

Contribution

It introduces an asymptotically optimal shrinker for precision matrix estimation, extending results to manifold settings and analyzing the impact of noise on Mahalanobis distance.

Findings

01

Identifies the noise threshold where Mahalanobis distance fails

02

Proposes an optimal shrinker for precision matrix estimation

03

Demonstrates improved performance over classical methods

Abstract

Motivated by establishing theoretical foundations for various manifold learning algorithms, we study the problem of Mahalanobis distance (MD), and the associated precision matrix, estimation from high-dimensional noisy data. By relying on recent transformative results in covariance matrix estimation, we demonstrate the sensitivity of \MD~and the associated precision matrix to measurement noise, determining the exact asymptotic signal-to-noise ratio at which MD fails, and quantifying its performance otherwise. In addition, for an appropriate loss function, we propose an asymptotically optimal shrinker, which is shown to be beneficial over the classical implementation of the MD, both analytically and in simulations. The result is extended to the manifold setup, where the nonlinear interaction between curvature and high-dimensional noise is taken care of. The developed solution is applied…

Figures4

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1. The normalized loss of the optimal shrinker estimator M n = η ~ ∗ ( S n ) subscript 𝑀 𝑛 superscript ~ 𝜂 subscript 𝑆 𝑛 M_{n}=\tilde{\eta}^{*}(S_{n}) and the classical estimator M n = η σ classical ( S n ) subscript 𝑀 𝑛 subscript superscript 𝜂 classical 𝜎 subscript 𝑆 𝑛 M_{n}=\eta^{\text{classical}}_{\sigma}(S_{n}) in the manifold setup. The mean and the standard deviation over 500 500 500 realizations are reported.

		$Error (M_{n}, y_{1})$		$Error (M_{n}, y_{2})$
		$η_{σ}^{classical} (S_{n})$	${\tilde{η}}^{*} (S_{n})$	$η_{σ}^{classical} (S_{n})$	${\tilde{η}}^{*} (S_{n})$
	$σ = 1$	18.78 $\pm$ 1.16	0.78 $\pm$ 0.54	55.98 $\pm$ 2.09	1.32 $\pm$ 0.93
$β = 0.1$	$σ = 1.5$	23.72 $\pm$ 1.19	1.41 $\pm$ 0.87	59.42 $\pm$ 2.10	2.59 $\pm$ 1.74
	$σ = 2$	33.54 $\pm$ 1.53	2.18 $\pm$ 1.31	64.04 $\pm$ 1.69	5.19 $\pm$ 2.99
	$σ = 1$	26.86 $\pm$ 2.70	2.41 $\pm$ 1.55	60.88 $\pm$ 4.22	4.06 $\pm$ 2.54
$β = 0.5$	$σ = 1.5$	42.66 $\pm$ 3.43	4.78 $\pm$ 2.65	69.06 $\pm$ 3.43	10.65 $\pm$ 5.38
	$σ = 2$	58.59 $\pm$ 3.24	9.84 $\pm$ 4.63	77.24 $\pm$ 2.81	31.52 $\pm$ 17.56
	$σ = 1$	34.70 $\pm$ 4.89	4.05 $\pm$ 2.35	64.11 $\pm$ 5.57	8.39 $\pm$ 3.91
$β = 1$	$σ = 1.5$	54.72 $\pm$ 4.46	10.62 $\pm$ 4.69	75.54 $\pm$ 3.63	23.97 $\pm$ 12.49
	$σ = 2$	69.65 $\pm$ 3.97	21.35 $\pm$ 7.94	83.30 $\pm$ 2.78	62.99 $\pm$ 19.34

Equations147

Σ_{O (x)} := E [(X - μ_{O (x)}) (X - μ_{O (x)})^{⊤} χ_{O (x)} (X)] \in R^{p \times p}

Σ_{O (x)} := E [(X - μ_{O (x)}) (X - μ_{O (x)})^{⊤} χ_{O (x)} (X)] \in R^{p \times p}

μ_{O (x)} := E [X χ_{O (x)} (X)] = \frac{1}{∣ O ( x ) \cap ι ( M ) ∣} \int_{O (x) \cap ι (M)} z P (z) d z,

μ_{O (x)} := E [X χ_{O (x)} (X)] = \frac{1}{∣ O ( x ) \cap ι ( M ) ∣} \int_{O (x) \cap ι (M)} z P (z) d z,

X_{x} := X χ_{O (x)} (X),

X_{x} := X χ_{O (x)} (X),

Y = X + σ ξ,

Y = X + σ ξ,

Y_{x} := Y χ_{O (x)} (X),

Y_{x} := Y χ_{O (x)} (X),

X \sim N (μ, Σ_{X}) \in R^{p},

X \sim N (μ, Σ_{X}) \in R^{p},

X = μ + l = 1 \sum d λ_{l} ζ_{l} u_{l},

X = μ + l = 1 \sum d λ_{l} ζ_{l} u_{l},

Y = X + σ ξ,

Y = X + σ ξ,

\displaystyle\Sigma_{B_{\epsilon}^{\mathbb{R}^{p}}(\mu_{x})}=\frac{|S^{d-1}|{P}(x)}{d(d+2)}\epsilon^{d+2}\Big{(}\begin{bmatrix}I_{d\times d}&0\\ 0&0\\ \end{bmatrix}+O(\epsilon^{2})\Big{)}\,,

\displaystyle\Sigma_{B_{\epsilon}^{\mathbb{R}^{p}}(\mu_{x})}=\frac{|S^{d-1}|{P}(x)}{d(d+2)}\epsilon^{d+2}\Big{(}\begin{bmatrix}I_{d\times d}&0\\ 0&0\\ \end{bmatrix}+O(\epsilon^{2})\Big{)}\,,

d_{Σ_{x}}^{2} (z, X_{x}) := (z - μ_{x})^{⊤} I_{d} (Σ_{x}) (z - μ_{x}),

d_{Σ_{x}}^{2} (z, X_{x}) := (z - μ_{x})^{⊤} I_{d} (Σ_{x}) (z - μ_{x}),

I_{d} (Σ_{x}) := U diag (1/ λ_{1}, \dots, 1/ λ_{d}, 0, \dots, 0) U^{⊤},

I_{d} (Σ_{x}) := U diag (1/ λ_{1}, \dots, 1/ λ_{d}, 0, \dots, 0) U^{⊤},

d_{Σ_{X}}^{2} (z, X) = (z - μ)^{⊤} Σ_{X}^{†} (z - μ),

d_{Σ_{X}}^{2} (z, X) = (z - μ)^{⊤} Σ_{X}^{†} (z - μ),

d_{Σ_{X}}^{2} (z, X) = (z - μ)^{⊤} (W W^{⊤}) (z - μ) = ∥ W^{⊤} (z - μ) ∥_{R^{p}}^{2},

d_{Σ_{X}}^{2} (z, X) = (z - μ)^{⊤} (W W^{⊤}) (z - μ) = ∥ W^{⊤} (z - μ) ∥_{R^{p}}^{2},

Σ_{X} = c^{2} A Σ_{X} A^{⊤}

Σ_{X} = c^{2} A Σ_{X} A^{⊤}

d_{Σ_{X}}^{2} (\tilde{z}, X)

d_{Σ_{X}}^{2} (\tilde{z}, X)

\frac{d ( d + 2 )}{∣ S ^{d - 1} ∣ ϵ ^{d + 2}} \tilde{d}_{Σ_{x}^{†}} (z, X_{x}) = t + O (t),

\frac{d ( d + 2 )}{∣ S ^{d - 1} ∣ ϵ ^{d + 2}} \tilde{d}_{Σ_{x}^{†}} (z, X_{x}) = t + O (t),

\|{\theta}-\bar{\theta}\|_{2}^{2}=\frac{1}{2}(x-\bar{x})^{\top}\big{[}C^{\dagger}+\bar{C}^{\dagger}\big{]}(x-\bar{x})+O(\|x-\bar{x}\|_{2}^{4})\,,

\|{\theta}-\bar{\theta}\|_{2}^{2}=\frac{1}{2}(x-\bar{x})^{\top}\big{[}C^{\dagger}+\bar{C}^{\dagger}\big{]}(x-\bar{x})+O(\|x-\bar{x}\|_{2}^{4})\,,

Σ_{x} = \nabla ϕ ∣_{θ} \nabla ϕ ∣_{θ}^{⊤}

Σ_{x} = \nabla ϕ ∣_{θ} \nabla ϕ ∣_{θ}^{⊤}

d θ (t) = a (θ (t)) d t + d ω (t),

d θ (t) = a (θ (t)) d t + d ω (t),

d u_{t} = (\frac{1}{2} ΔΦ ∣_{θ_{t}} + \nablaΦ ∣_{θ_{t}} a (θ_{t})) d t + \nablaΦ ∣_{θ_{t}} d ω_{t}

d u_{t} = (\frac{1}{2} ΔΦ ∣_{θ_{t}} + \nablaΦ ∣_{θ_{t}} a (θ_{t})) d t + \nablaΦ ∣_{θ_{t}} d ω_{t}

Cov (d u_{t}) = \nablaΦ ∣_{θ_{t}} \nablaΦ ∣_{θ_{t}}^{⊤}

Cov (d u_{t}) = \nablaΦ ∣_{θ_{t}} \nablaΦ ∣_{θ_{t}}^{⊤}

\|{\theta}_{i}-{\theta}_{j}\|_{\mathbb{R}^{d}}^{2}=\frac{1}{2}(u_{i}-u_{j})^{\top}\big{[}C_{i}^{\dagger}+C_{j}^{\dagger}\big{]}(u_{i}-u_{j})+O(\|u_{i}-u_{j}\|^{4})\,,

\|{\theta}_{i}-{\theta}_{j}\|_{\mathbb{R}^{d}}^{2}=\frac{1}{2}(u_{i}-u_{j})^{\top}\big{[}C_{i}^{\dagger}+C_{j}^{\dagger}\big{]}(u_{i}-u_{j})+O(\|u_{i}-u_{j}\|^{4})\,,

d_{M_{n}}^{2} (z, X) = (z - μ)^{⊤} M_{n} (z - μ) .

d_{M_{n}}^{2} (z, X) = (z - μ)^{⊤} M_{n} (z - μ) .

\displaystyle\Big{|}d^{2}_{\Sigma_{X}}(z,X)-d^{2}_{M_{n}}(z,X))\Big{|}=\Big{|}(z-\mathbf{\mu})^{\top}[\Sigma_{X}^{\dagger}-M_{n}](z-\mathbf{\mu})\Big{|}\,.

\displaystyle\Big{|}d^{2}_{\Sigma_{X}}(z,X)-d^{2}_{M_{n}}(z,X))\Big{|}=\Big{|}(z-\mathbf{\mu})^{\top}[\Sigma_{X}^{\dagger}-M_{n}](z-\mathbf{\mu})\Big{|}\,.

L_{n} (M_{n}, Σ_{X}^{†})

L_{n} (M_{n}, Σ_{X}^{†})

\eta_{\sigma}^{\textup{classical}}(\alpha)=\left\{\begin{array}[]{cc}1/(\alpha-\sigma^{2})&\alpha>\sigma^{2}\\ 0&\alpha\leq\sigma^{2}\end{array}\right..

\eta_{\sigma}^{\textup{classical}}(\alpha)=\left\{\begin{array}[]{cc}1/(\alpha-\sigma^{2})&\alpha>\sigma^{2}\\ 0&\alpha\leq\sigma^{2}\end{array}\right..

n \to \infty lim L_{n} (η_{σ}^{classical} (S_{n}), Σ_{X}^{†}) = 0 .

n \to \infty lim L_{n} (η_{σ}^{classical} (S_{n}), Σ_{X}^{†}) = 0 .

Σ_{X} = U [Σ_{d} 0 0 0_{p - d}] U^{⊤} \in R^{p \times p},

Σ_{X} = U [Σ_{d} 0 0 0_{p - d}] U^{⊤} \in R^{p \times p},

S_{n} = V_{n} diag (λ_{1, n}, \dots, λ_{p, n}) V_{n}^{⊤} \in R^{p \times p},

S_{n} = V_{n} diag (λ_{1, n}, \dots, λ_{p, n}) V_{n}^{⊤} \in R^{p \times p},

\frac{( λ _{+} - x ) ( x - λ _{-} )}{2 π β x} 1_{[λ_{-}, λ_{+}]} d x,

\frac{( λ _{+} - x ) ( x - λ _{-} )}{2 π β x} 1_{[λ_{-}, λ_{+}]} d x,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Mechanics and Entropy · Sparse and Compressive Sensing Techniques · Blind Source Separation Techniques

Full text

Optimal Recovery of Precision Matrix for Mahalanobis Distance from High Dimensional Noisy Observations in Manifold Learning

Matan Gavish

The School of Computer Science and Engineering, Hebrew University of Jerusalem

[email protected]

,

Ronen Talmon

The Department of Electrical Engineering, Technion – Israel Institute of Technology

[email protected]

,

Pei-Chun Su

The Department of Mathematics, Duke University

[email protected]

and

Hau-Tieng Wu

The Department of Mathematics and Department of Statistical Science, Duke University

[email protected]

Abstract.

Motivated by establishing theoretical foundations for various manifold learning algorithms, we study the problem of Mahalanobis distance (MD), and the associated precision matrix, estimation from high-dimensional noisy data. By relying on recent transformative results in covariance matrix estimation, we demonstrate the sensitivity of MD and the associated precision matrix to measurement noise, determining the exact asymptotic signal-to-noise ratio at which MD fails, and quantifying its performance otherwise. In addition, for an appropriate loss function, we propose an asymptotically optimal shrinker, which is shown to be beneficial over the classical implementation of the MD, both analytically and in simulations. The result is extended to the manifold setup, where the nonlinear interaction between curvature and high-dimensional noise is taken care of. The developed solution is applied to study a multiscale reduction problem in the dynamical system analysis. Mahalanobis distance, large $p$ large $n$ , optimal shrinkage, precision matrix

2000 Math Subject Classification: 34K30, 35K57, 35Q80, 92D25

1. INTRODUCTION

High-dimensional datasets encountered in modern science often exhibit nonlinear low-dimensional structures. One prominent approach to deal with such point clouds is to model their nonlinear structures by manifolds. In the last two decades, this direction has led to the emergence of a multitude of manifold learning methods, including the classical ISOMAP [42], locally linear embedding (LLE) [29], Hessian LLE [8], eigenmap [4], and diffusion maps [5], as well as the more recent vector diffusion maps [33], multiview diffusion maps [21], and alternating diffusion maps [18, 41], to name but a few. Typically in manifold learning, point clouds in $\mathbb{R}^{p}$ are assumed to be sampled from a $d$ -dimensional smooth manifold $\mathcal{M}$ embedded in $\mathbb{R}^{p}$ , usually with some additional contaminating noise. In this setting, the manifold represents the “essence” or the “signal” of the data. Consequently, the goal in manifold learning is to recover the geometric or topological structure of the manifold from the data points, and in turn, to use the recovered structure to embed the high-dimensional data in a low-dimension space, facilitating a compact and informative representation of the data. This approach has been successfully applied to applications from a broad range of fields, e.g. dynamical systems modeling [38, 48], sleep stage assessment [45, 22], cryo electro microscope [34], image denoising [31], single channel blind source separation like fetal electrocardiogram analysis [36] and stimulation artifact removal for the intracranial electroencephalogram [1], and long term physiological signal visualization and analysis [20, 43].

Classical manifold learning methods heavily rely on meaningful measures of pairwise discrepancy between data points. In this so-called metric design problem, the data analyst aims to find a useful metric representing the relationship between data points embedded in a high-dimensional space. In this paper, we study the Mahalanobis Distance (MD) – a popular, and arguably the first method for metric design [23, 26]. MD was originally proposed in 1936 with the classical low-dimensional setting in mind, namely, for the case where the ambient dimension $p$ is much larger than the dataset size $n$ . Interestingly, due to its useful statistical and invariance properties, MD became the basis of several geometric data analysis techniques [49, 44, 47], aimed specifically at the high-dimensional regime $p\asymp n$ .

In a recent line of work [30, 37], a variant of MD was proposed and used to reveal hidden intrinsic manifolds underlying high-dimensional, possibly multimodal, observations. The main purpose of MD in this hidden manifold setup is handling possible deformations caused by the observation or sampling process. Broadly, this is carried out by estimating a quadratic form of the Jacobian of the (unknown) observation function, which is equivalent to estimating the precision matrix locally on the manifold. It was recently shown that MD is also implicitly used in the seminal LLE algorithm [29], when the barycenter step is properly expressed [24].

As the number of dimensions $p$ in typical data analysis applications continues to grow, it becomes increasingly crucial to understand the behaviour of MD, as well as other metric design algorithms, in the high-dimensional regime $p\asymp n$ . At first glance, it might seem that this regime poses little more than a computational inconvenience for metric design using MD. Indeed, it is easy to show that in the absence of measurement noise, MD cares little about the increase in the ambient dimension $p$ . This paper however calls attention to the following key observation. In the high-dimensional regime $n\asymp p$ , in the presence of ambient measurement noise, a new phenomenon emerges, which introduces various nontrivial effects on the performance of MD. Depending on the noise level, in the high-dimensional regime, MD may be adversarially affected or even fail completely. Clearly, the assumption of measurement noise cannot be realistically excluded, and yet, to the best of our knowledge, this phenomenon has not been previously fully studied. A first step in this direction was taken in [6], with the calculation of the distribution of MD under specific assumptions.

Let us describe this key phenomenon informally at first. The computation of MD involves estimation of the inverse covariance matrix, or precision matrix, corresponding to the data at hand. Classically, the estimation relies on the sample covariance, which is inverted using the Moore-Penrose pseudo-inverse. It is well-known that, in the high dimensional setup, or in the regime $n\asymp p$ , the sample covariance matrix is a poor estimator of the underlying covariance matrix. Indeed, advances in random matrix theory from the last decade imply that the eigenvalues and eigenvectors of the sample covariance matrix are both biased, namely, do not converge to the corresponding eigenvalues and eigenvectors of the underlying population covariance matrix [28, 14]. Such biases in small eigenvalues, which lead to inaccurate covariance matrix estimation, become immense when applying the Moore-Penrose pseudo-inverse.

This challenge in the high dimensional setup is amplified when we have a nonlinear manifold structure. Inverting small (inconsistent) eigenvalues is challenging when evaluating the precision matrix, and this issue becomes more challenging in the context of manifold learning, where the estimation of MD is performed locally [30, 37]. Ideally, under the manifold assumption, in infinitely small neighborhoods and for a sufficiently large number of samples without noise, the rank of the local sample covariance equals the dimension of the manifold, which is typically smaller than the dimension of the ambient space, making the sample covariance low-rank with distinct strictly zero eigenvalues. However, in practice, due to the finite sample set, the considered neighborhoods cannot be sufficiently small. In such cases, depending on the manifold curvature, samples from the manifold depart from the tangent space to the manifold, and the rank of their sample covariance matrix undesirably grow, where the eigenvalues related to curvature are much smaller compared with those related to the tangent space. When noise exists, the situation is further complicated by the interaction between those curvature related small eigenvalues and the relatively few points compared with the ambient space dimension.

In this paper, we study this problem and propose a remedy. By relying on formal existing results in covariance matrix estimation, we measure the sensitivity of MD to measurement noise. Under the assumption that locally the data on the manifold lie on a low-dimensional linear subspace embedded in the ambient space $\mathbb{R}^{p}$ and that the measurement noise is Gaussian white, we are able to determine the exact asymptotic signal-to-noise ratio at which MD fails, and quantify its performance otherwise. Then, we propose a better MD estimator based on the idea of shrinkage of the associated precision matrix. It has been known since the 1970’s [35] that by shrinking the sample covariance eigenvalues one can significantly mitigate the noise effects and improve the covariance estimation in high-dimensions. We formulate the classical MD as a particular choice of shrinkage estimator for the eigenvalues of the sample covariance. Building on recent results in high-dimensional covariance estimation, including the general theory in [7] and a special case with application in random tomography [32, Section 4.4], we find an asymptotically optimal shrinker for the precision matrix estimation, which is better than the classical implementation of the MD, whenever MD is computed from noisy high-dimensional data. We show that under a suitable choice of a loss function for the estimation of MD, our shrinker is the unique asymptotically optimal shrinker; the improvement in asymptotic loss it offers over the classical MD is calculated exactly. We then extend the above established results to handle the challenge of designing a better metric when applying diffusion map. This extension is nontrivial due to the nontrivial interaction of curvature and noise, and the finite sampling size. Finally, we apply diffusion maps with the proposed estimate of MDto separate slow and fast dynamics when the observation is contaminated by high dimensional noise.

While the present paper focuses on MD, we posit that the same phenomenon holds much more broadly and in fact affects several widely-used manifold learning, particular in metric learning algorithms. In this regard, the present paper seeks to highlight the fact that manifold learning and metric learning algorithms will not perform as predicted by the noiseless theory in high dimensions, and may fail completely beyond a certain noise level.

2. PROBLEM SETUP

2.1. Manifold model

When a point cloud $\mathcal{X}:=\{x_{i}\}_{i=1}^{n}\subset\mathbb{R}^{p}$ has a nontrivial nonlinear structure, or even nontrivial topological structure, a common approach is to model this structure by a manifold. This is known as the so-called manifold assumption. Such manifold assumption holds for various practical data. Examples include cryo-electro microscopy [34], phase spaces of dynamical systems [38, 48], and various biomedical signals [36, 1, 20, 43]. The main feature of this manifold assumption is that the points are distributed on a nonlinear set so that they are nonlinearly related, which generalizes the commonly used linear model.

To model $\mathcal{X}$ by the manifold model, consider a $p$ -dimensional random vector $X:\Omega\rightarrow\mathbb{R}^{p}$ , which is a measurable function with respect to the probability space $(\Omega,\mathcal{F},\mathbb{P})$ , where $\mathbb{P}$ is the probability measure defined on the sigma algebra $\mathcal{F}$ in the event space $\Omega$ . We assume that the range of $X$ is supported on a $d$ -dimensional compact, smooth Riemannian manifold $(M,g)$ isometrically embedded in $\mathbb{R}^{p}$ via $\iota:M\hookrightarrow\mathbb{R}^{p}$ . In this work, we assume that $M$ is boundary-free to simplify the exposition. We shall mention that the commonly considered linear model is a special manifold model, where the manifold is an affine linear subspace space of $\mathbb{R}^{p}$ .

On the manifold, the associated statistical setup is as follows. Let $\tilde{\mathcal{B}}$ be the Borel sigma algebra of $\iota(M)$ , and let $\tilde{\mathbb{P}}_{X}$ denote the probability measure induced from $X$ . Clearly, $\tilde{\mathbb{P}}_{X}$ is defined on $\tilde{\mathcal{B}}$ . Denote $dV$ to be the Riemannian volume density of $M$ associated with the metric $g$ . For simplicity, we assume that $\tilde{\mathbb{P}}_{X}$ is absolutely continuous with respect to the induced Riemannian measure on $\iota(M)$ , denoted by $\iota_{*}dV$ . By the Radon-Nikodym theorem, for any $z\in\iota(M)\subset\mathbb{R}^{p}$ , there exists a non-negative measurable function $P(z)$ defined on $\iota(M)$ such that $d\tilde{\mathbb{P}}_{X}(z)=P(z)\iota_{*}dV(z)$ . The probability density function (pdf) of $X$ on $M$ is defined to be $P(z)$ . We further assume that $P(z)$ is bounded away from [math] and smooth. When $P(z)$ is constant, we call $X$ a uniform random sampling scheme; otherwise it is nonuniform. Now we introduce the key quantity of interest in this paper, the local covariance matrix.

Definition 1.

Fix $x\in M$ . For an open simply connected neighborhood of $x$ , $\mathcal{O}(x)\subset\mathbb{R}^{p}$ , define

[TABLE]

as the local covariance matrix associated with $\mathcal{O}(x)$ centered at $\mu_{\mathcal{O}(x)}$ , where

[TABLE]

$|\mathcal{O}(x)\cap\iota(M)|$ is the volume of $\mathcal{O}(x)\cap\iota(M)$ , and $\chi_{\mathcal{O}(x)}$ is the indicator function of the set $\mathcal{O}(x)$ .

One main goal of considering this local covariance is capturing those directions with maximal variation of the dataset. From the knowledge of principal component analysis, those directions are the eigenvectors of the local covariance matrix with the largest eigenvalues. We have some remarks. First, by Nash’s isometric embedding theorem, $\Sigma_{\mathcal{O}(x)}$ is of rank $D$ , where $D\leq d(3d+11)/2$ for any $\mathcal{O}(x)$ . Second, in existing literature, there is another different definition of the local covariance matrix, in which the mean $\mu_{\mathcal{O}(x)}$ is replaced with $\iota(x)$ . In general, the two definitions are different, even when the set is perfectly symmetric, e.g. $\mathcal{O}(x)$ is a ball, and $P(x)$ is uniform. Indeed, since the range of $X$ is supported on $\iota(M)$ , when $\iota(M)$ is not flat, $\iota(x)$ might deviate from the center of $\mathcal{O}(x)$ due to the curvature of the manifold; that is, $\frac{1}{|\mathcal{O}(x)\cap\iota(M)|}\int_{\mathcal{O}(x)\cap\iota{M}}(z-\iota(x))dz\neq 0$ . While this seems to be a problem, it was shown in [24] that the difference between the center of $\mathcal{O}(x)$ and $\iota(x)$ is negligible (expressed as a higher order term in the error) when the diameter of $\mathcal{O}(x)$ is sufficiently small. Broadly, since locally the manifold can be well approximated by an affine subspace, when the diameter of $\mathcal{O}(x)$ is sufficiently small, the data located in $\mathcal{O}(x)$ can be well approximated by the tangent space to the manifold at $x$ , i.e., $T_{x}M$ . This point will be further addressed in Section 2.2.

The above derivation leads to the following local statistical model, which enables to further study the local structure of a manifold. For $x\in M$ and an open, simply connected neighborhood of $x$ , $\mathcal{O}(x)\subset\mathbb{R}^{p}$ , we define a new random vector

[TABLE]

with mean $\mu_{x}:=\mu_{\mathcal{O}(x)}$ and covariance matrix $\Sigma_{x}:=\Sigma_{\mathcal{O}(x)}$ . By definition, $X_{x}$ is a bounded random vector, and hence, all its moments are finite. Also, as has been widely discussed in the literature (see [24] and reference therein), the first $d$ dominant eigenvectors of $\Sigma_{x}$ form an accurate estimate of the tangent space to the manifold at $x$ . Specifically, if $\Sigma_{x}u_{l}=\lambda_{l}u_{l}$ , for $l=1,\ldots,p$ , where $\lambda_{1}\geq\lambda_{2}\ldots\geq\lambda_{p}$ , then the span of $\{u_{1},\ldots,u_{d}\}$ approximates $\iota_{*}T_{x}M$ .

Often in applications, the manifold is not directly accessible due to additional noise, and we can only sample

[TABLE]

where $\xi\sim\mathcal{N}(0,I_{p})$ . As a consequence, for $x\in M$ , in the presence of noise, the corresponding local random vector can be recast as

[TABLE]

with mean $\mu_{x}:=\mu_{\mathcal{O}(x)}$ and covariance matrix $\Sigma_{x}+\sigma^{2}I_{p}$ . Although $Y_{x}$ is not a bounded random vector, all its moments are finite.

We remark that the local set $\mathcal{O}(x)$ can be defined in several plausible ways, depending on the problem at hand. A common choice is the ball $B_{x}(\epsilon)$ with the center at $\iota(x)$ and the radius $\epsilon>0$ . In other applications, $\mathcal{O}(x)$ might be an ellipsoid [30, 37, 24] or a more general setup, depending on the metric of interest. We will revisit to this issue in the sequel.

2.2. Linear spiked model

In manifold learning, local kernels are commonly used, e.g., kernels based on radial basis functions. The use of kernels implies that only points around the center of the kernel $x\in M$ contribute to the algorithm outcome. Therefore, considering those points located inside the neighborhood of $x$ , $\mathcal{O}(x)$ , is sufficient for analyzing a manifold learning algorithm with a local kernel. Since the bandwidth of the kernel is typically small, the diameter of the local set $\mathcal{O}(x)$ is small as well. Consequently, by the definition of a manifold, data in $\mathcal{O}(x)$ can be well approximated by the tangent space to the manifold at $x$ , which is a low dimensional affine subspace. Note that in the special case that the manifold is a linear affine subspace, all points are located by a low dimensional space.

Since locally the manifold can be viewed merely as a linear subspace, the local statistical model described in Section 2.1 can be well approximated by the classical linear spiked model (or spiked covariance model [14]), which is detailed next with slight modifications in the notation. We note that in this section, deviations of the samples from a linear space due to the manifold curvature will not be treated, and their affect will be considered together with the affect of the ambient noise. However, we will extend and test the ability of the proposed estimator to handle such phenomena in the simulation study in Section 6.

We now consider the spike model. In plain English, a spike model is a manifold model when the manifold is an affine subspace, where the dimension of the affine subspace is fixed. Consider a point cloud in $\mathbb{R}^{p}$ supported on a $d$ -dimensional linear subspace, where $d\leq p$ . For simplicity, we assume that the point cloud is sampled independently and identically (i.i.d.) from

[TABLE]

where $\mathbf{\mu}$ denotes the mean and $\Sigma_{X}$ denotes the population covariance matrix, whose rank is equal to $d$ . Note that we could consider a random vector with mean [math] and a finite fourth moment [7]; since this could only increase the notational burden without providing additional insights, we focus on this simplified model.

It is often convenient to note that a point cloud sampled from $X$ can be understood as sampling i.i.d. from a $p$ -dim random vector

[TABLE]

where $\lambda_{l}>0$ , $\zeta_{l}\sim\mathcal{N}(0,1)$ , $\mathbb{E}\zeta_{l}\zeta_{k}=\delta_{l,k}$ , $\Sigma_{X}u_{l}=\lambda_{l}u_{l}$ and $\|u_{l}\|_{L^{2}}=1$ for $l=1,\ldots,d$ . Thus, the $d$ -dimensional linear affine subspace is the space spanned by $u_{1},\ldots,u_{d}$ , which could be understood as spikes and hence the nomination of the model, and shifted by $\mathbf{\mu}$ . We note that this global linear model is related to the local structure of a manifold in (2) in the following way: $\mu$ here is the center point $x$ on the manifold in (2), and the $d$ -dimensional linear affine subspace spanned by $u_{1},\ldots,u_{d}$ is the tangent space at $x$ , while we need extra components to capture the curvature of the manifold.

Similarly to the manifold model, suppose that the samples of $X$ , which we refer to as the signal, are not directly observable. Instead, the observed data consist of samples from the random variable

[TABLE]

where $0\leq\sigma<\infty$ and $\xi\sim\mathcal{N}(0,I_{p})$ is a Gaussian measurement noise independent of $X$ , which we assume for simplicity to be white.

3. PRECISION MATRIX AND MAHALANOBIS DISTANCE ESTIMATION

The focal point of this paper is the estimation of the MD under the manifold model from a statistical standpoint, which, as described in Section 2, leads to the classical linear spiked model. The estimation of the MD involves the estimation of the local precision matrix. Therefore, we start this section with details on the estimation of the precision matrix in Section 3.1, followed by a detailed description of the estimation of the MD in Section 3.2.

3.1. Precision matrix estimation and its challenge

In Section 2.1 and Section 2.2, we showed that the local and global statistical models are seemingly very similar. Indeed, at first glance, both models consist of a hidden “signal” component ((2) and (5)) and noisy accessible observations ((4) and (7)). Furthermore, under both models, the local and global population covariance matrices of the “signal” component ( $\Sigma_{x}$ and $\Sigma_{X}$ ) are approximately low rank and precisely low rank, respectively.

However, one of the main claims of this work is that although the two models seem equivalent, subtle differences between them, particularly in the context of precision matrix estimation, become fundamental.

Let us consider first the estimation of the precision matrix in the simpler, classical linear spiked model described in (5) and (7). Without noise, the computation of the precision matrix of $X$ could be simply implemented using the Moore-Penrose pseudo-inverse. Suppose the eigendecomposition of $\Sigma_{X}$ is given by $\Sigma_{X}=U\textup{diag}(\lambda_{1},\ldots,\lambda_{d},0,\ldots,0)U^{\top}$ where $\lambda_{1}\geq\lambda_{2}\geq\ldots$ and $U\in O(p)$ . Then, the pseudo-inverse is given by $\Sigma_{X}^{\dagger}=U\textup{diag}(1/\lambda_{1},\ldots,1/\lambda_{d},0,\ldots,0)U^{\top}$ , namely, inverting the non-zero eigenvalues. The introduction of additive noise poses a significant challenge for such an estimation, since the small eigenvalues could be mixed with the noise. While the contribution of such small eigenvalues is limited in the composition of the covariance matrix, in the composition of the precision matrix their affect becomes significant.

In addition to this classical challenge, the estimation of the precision matrix under the manifold model poses another layer of complexity. To simplify the discussion, we assume that $\mathcal{O}(x)$ is a simply connected ball, that is $\mathcal{O}(x)=B_{\epsilon}^{\mathbb{R}^{p}}(\mu_{x})$ , where $B_{\epsilon}^{\mathbb{R}^{p}}(\mu_{x})$ is a Euclidean ball in $\mathbb{R}^{p}$ of radius $\epsilon>0$ centered at $\mu_{x}$ with a sufficiently small $\epsilon$ . In this case, the geometric picture of the local covariance matrix is well captured by [46, Proposition 3.1], which is summarized as follows. Fix $x\in M$ . Assume that the manifold is translated and rotated properly, so that $x=0$ and the tangent space in $\mathbb{R}^{p}$ , $\iota_{*}T_{x}M$ , is spanned by $\{e_{1},\ldots,e_{d}\}$ , where $e_{j}$ is a unit $p$ -dim vector with $1$ in the $j$ -th entry. We have the following asymptotical expansion of (2) when $\epsilon>0$ is sufficiently small:

[TABLE]

where $S^{d-1}$ is a $(d-1)$ -sphere, $|S^{d-1}|$ is the volume of $S^{d-1}$ , and the implied constant in $O(\epsilon^{2})$ depends on the second fundamental form at $x$ . From a statistical perspective, this expansion well captures the intuition that the variability of the data located in $B_{\epsilon}^{\mathbb{R}^{p}}(\mu_{x})$ is restricted to only few directions aligned with $T_{x}M$ . Indeed, by applying the perturbation theory to (8), the $d$ eigenvectors of $\Sigma_{B_{\epsilon}^{\mathbb{R}^{p}}(\mu_{x})}$ corresponding to the largest $d$ eigenvalues provide an approximate basis for the embedded tangent plane $\iota_{*}T_{x}M$ . The expression of the covariance matrix in (8) implies that there is a significant spectral gap between the top $d$ eigenvalues and the remaining ones, which depend on the curvature of the manifold. Furthermore, these remaining eigenvalues could be very small with order higher than $\epsilon^{d+4}$ . In other words, under the manifold setting, in contrast to the linear spiked model, even in the noiseless case, the covariance is not strictly low rank with only $d$ nonzero eigenvalues. Such small eigenvalues due to the curvature of the manifold, if not properly taken care before taking inverse to estimate the precision matrix, the obtained precision matrix might be deformed. Such small eigenvalues can be added to the noise components and together they obscure the spectral gap. Thus, it is necessary for determining the “effective rank” of the matrix, which is essential for the calculation of the precision matrix. For more details, we refer the readers to the detailed discussion in [46, 24].

In addition to the above challenge due to the curvature, the presence of noise particularly in the manifold setting imposes another challenge – that is, how to find the true neighbors? Specifically, note that the neighboring points of $x$ denoted by $\chi_{\mathcal{O}(x)}$ need to be identified from the noisy samples. When $\sigma=0$ , i.e. in the noiseless case, a neighbor can be easily identified if the diameter of $\mathcal{O}(x)$ is sufficiently small. However, when $\sigma>0$ , i.e., in the presence of noise, it is not clear if a neighbor determined from the noisy point cloud is truly a neighbor. Concretely, let $\mathcal{X}=\{x_{i}\}_{i=1}^{n}\subset\iota(M)\subset\mathbb{R}^{p}$ denote a set of identical and independent (i.i.d.) random samples from $X$ , and $\mathcal{Y}=\{y_{i}\}$ , where $y_{i}=x_{i}+\sigma\xi_{i}$ is sampled from $Y$ . Then, in general $\|y_{i}-y_{j}\|_{\mathbb{R}^{p}}\leq\epsilon$ does not imply $\|x_{i}-x_{j}\|_{\mathbb{R}^{p}}\leq\epsilon$ . Thus, a naive approach to determine neighbors might fail. To the best of our knowledge, there are only few existing algorithms for determining neighbors when the point cloud is noisy. For example, determining a neighbor by the diffusion distance [5], which has a solid theoretical support when noise exists [12, 13]. Since finding neighbors is still a challenging problem on its own, in this paper, we focus only on estimating the precision matrix, and subsequently, the MD, assuming that the true neighbors are known.

3.2. Mahalanobis distance

We are now ready to define the MD: we start with the definition of the MD under the manifold model, and then for reference, we present the typical definition under the global linear spiked model.

Under the manifold model, since the covariance matrix may have no strictly zero eigenvalues even in the noiseless case due to the curvature, we consider the following definition of the MD.

Definition 2 (Mahalanobis distance under the manifold model).

Suppose $x\in M$ . The MD between $z\in\iota(M)$ and $X_{x}$ is defined by

[TABLE]

where $\mathcal{I}_{d}(\Sigma_{x})$ is the truncated pseudo-inverse of degree $d$ defined by

[TABLE]

$\Sigma_{x}=U\textup{diag}(\lambda_{1},\ldots,\lambda_{p})U^{\top}$ is the eigendecomposition of $\Sigma_{x}$ , and $\lambda_{1}\geq\lambda_{2}\geq\ldots$ .

Note that the knowledge of the manifold dimension $d$ is required for this definition; yet, it is in general not available and needs to be estimated.

For comparison, consider also the classical definition of the MD under the global linear spiked model.

Definition 3 (Mahalanobis distance under the linear spiked model).

The MD between an arbitrary point ${z}\in\mathbb{R}^{p}$ and the underlying signal distribution $X$ (5) is defined by

[TABLE]

where $\dagger$ denotes the Moore-Penrose pseudo-inverse.

Let us take a closer look at the latter definition. Since $\Sigma_{X}$ is semi-positive definite, by the Cholesky decomposition, we have $\Sigma_{X}^{\dagger}=WW^{\top}$ , where $W\in\mathbb{R}^{p\times d}$ . Hence

[TABLE]

which indicates that geometrically, MD evaluates the relationship between ${z}$ and (the mean of) $X$ by a proper linear transform. In Section 3.3, we relate $W^{\top}$ to the inverse of the Jacobian of arbitrary unknown observation functions and show that it gives rise to an important invariance property of the MD in the context of manifold learning. Here, we only demonstrate a primary merit of MD, which stems from its invariance to rotation and rescaling. Importantly, this invariance property holds for both definitions: under the linear spiked model as well as under the manifold model. Consider the linear spiked model and a random variable $\widetilde{X}=cAX$ , where $c\in\mathbb{R}$ models rescaling and $A\in O(p)$ models rotation. Here $O(p)$ denotes the group of $p$ -by- $p$ orthogonal matrices. The population covariance matrix of $\widetilde{X}$ is

[TABLE]

and its population mean is rotated and rescaled to ${\tilde{\mu}}=cA{\mu}$ . To demonstrating the invariance, suppose ${\tilde{z}}=cA{z}$ and observe the MD between ${\tilde{z}}$ and $\widetilde{X}$ :

[TABLE]

The same argument holds for the MD under the manifold model; for brevity, we omit the details.

Now, recall that the goal in this work is to estimate the MD between a point $z\in\mathbb{R}^{p}$ and $X$ , when we only have access to noisy data sampled from $Y$ in (7). Concretely, assume that ${y}_{1},\ldots,{y}_{n}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}Y$ is a sample of $n$ data points. Since $\Sigma_{X}$ is unknown, the quantity $d_{\Sigma_{X}}({z},X)$ , or $\Sigma_{X}^{\dagger}$ in (10), must be estimated from the (noisy) data. For simplicity, we assume below that ${\mu}$ and $\sigma$ are known; these assumptions can easily be removed in applications to real data. As discussed in Section 3.1, the notable challenge in estimating the MD from local samples of $Y$ is the estimation of the precision matrix via the pseudo-inverse, stemming from the interaction of the small eigenvalues of $\Sigma_{X}$ and the noise. In addition, as discussed in Section 3.1, the main difference between (5) and (2) is that in (5), the covariance matrix $\Sigma_{X}$ is global and strictly of low rank, whereas in (2), the covariance matrix $\Sigma_{x}$ depends on $x$ , and its rank is in general higher than $d$ when the manifold is of dimension $d$ , yet only its first $d$ eigenvalues are dominant. In the context of MD, this rank difference might lead to unwanted consequences if we consider Definition 3. Below we demonstrate it with an example.

Assume for simplicity that the pdf is uniform, $\mathcal{O}(x)=B_{\epsilon}^{\mathbb{R}^{p}}(\iota(x))$ and $\epsilon$ is sufficiently small. From (8), we observe a clear spectral gap between the first $d$ eigenvalues and the remaining ones. Therefore, we would expect that the behavior of $X_{x}$ in (2) is similar to that of $X$ in (5). However, we need to be careful with the associated precision matrices, and hence, with the two definitions of MD. Suppose that the main interest in estimating the MD is recovering the geodesic distance when the manifold cannot be directly accessed. It has been shown in [24, Theorem 8 (3)] that even if the manifold can be directly accessed, defining the MD between a point $z\in\iota(M)$ and $X_{x}$ by $\tilde{d}^{2}_{\Sigma_{x}}(z,X_{x})=(z-x)^{\top}\Sigma_{x}^{\dagger}(z-x)$ as in (10) might lead to a biased estimate of the geodesic distance. To be more precise, when the rank of $\Sigma_{x}$ is greater than $d$ and $\|x-z\|_{\mathbb{R}^{p}}=t$ is sufficiently small, asymptotically we have

[TABLE]

which means that the MD cannot even recover the basic geodesic distance since the order of the error is of the same order as the geodesic distance. Broadly, this bias is the result of the interaction of the small eigenvalues related to the curvature, particularly the second fundamental form, and the fact that the vector $z-x$ is not intrinsic to the manifold and might contain component that is normal to the tangent space at $x$ .

3.3. Motivating example

A motivating example for the centrality of the MD under the manifold model in data analysis was presented in [30], and later elaborated in the empirical intrinsic geometry (EIG) framework presented in [37, 39] with various applications, e.g., [40, 11, 45, 10, 9, 22, 17].

Suppose the manifold $M$ is merely an image of a different, intrinsic, and inaccessible manifold of interest $N$ , that is $M=\phi(N)$ , where $\phi$ is a diffeomorphic map. We may call $N$ the latent space. In applications, $\phi$ could represent the distortion of the data introduced by some measurement equipment. This “distortion model” is often of key important since it critically affects the intrinsic information $N$ we have interest in, even if noise does not exist. Under this model, it is shown in [30, Section 3] that when $M$ and $N$ are both Euclidean spaces, for $x,\bar{x}\in M$ , the following holds:

[TABLE]

where $\theta=\phi^{-1}(x),\bar{\theta}=\phi^{-1}(\bar{x})$ and $C=\nabla\phi|_{\theta}\nabla\phi|_{\theta}^{\top},\bar{C}=\nabla\phi|_{\bar{\theta}}\nabla\phi|_{\bar{\theta}}^{\top}$ . A similar statement for the case when $M$ and $N$ are both manifolds and $\phi$ is a diffeomorphism is given in [24]. It was further shown in [30] and [24] that if we have i.i.d. sampled from $M$ adhering to the statistical manifold models discussed above, then

[TABLE]

for a sufficiently small neighborhood $\mathcal{O}(x)$ . Remarkably, by combining (13) and (14), a small variant of the MD can recover the distance between ${\theta}$ and $\bar{\theta}$ from the hidden intrinsic manifold $M_{\theta}$ based on samples from $M$ without explicit information on the map $\phi$ , thereby solving a completely blind inverse problem.

The above distortion model was studied in [37] in the context of a nonlinear dynamical system. Denote the dataset or the point cloud as $\mathcal{U}:=\{u_{j}\}_{j=1}^{n}\subset\mathbb{R}^{q}$ , where $u_{j}$ is sampled at the $j$ -th time stamp. The key assumption is that $\mathcal{U}$ comes from observing an inaccessible intrinsic dynamics $\theta(t)\in\mathbb{R}^{p}$ that satisfies the stochastic differential equation

[TABLE]

where $a$ is an unknown drift function and $\omega$ is the standard $d$ -dimensional Brownian motion. We call the subset of $\mathbb{R}^{p}$ that hosts $\theta(t)$ the phase space; for example, an open subset of $\mathbb{R}^{p}$ , or a smooth manifold embedded in $\mathbb{R}^{p}$ .

The observation is modeled by a diffeomorphic function $\Phi:\mathbb{R}^{p}\rightarrow\mathbb{R}^{q}$ so that $u_{j}=\Phi(\theta_{j})$ , where $\theta_{j}$ is sampled from the intrinsic dynamics $\theta(t)$ at the $j$ -th time stamp. We call $\Phi$ the observation transform. Based on (15), and the fact that

[TABLE]

by the Ito’s formula, we obtain that

[TABLE]

since $(\frac{1}{2}\Delta\Phi|_{\theta_{t}}+\nabla\Phi|_{\theta_{t}}a(\theta_{t}))\mathrm{d}t$ is the drift. By combining the above facts, it is shown in [30, Section 3] that when $u_{i}$ and $u_{j}$ are sufficiently close and the phase space is flat, we could recover the intrinsic distance between ${\theta}_{i}$ and ${\theta}_{j}$ by

[TABLE]

where ${C}_{i}=\nabla\Phi({\theta}_{i})[\nabla\Phi({\theta}_{i})]^{\top}$ is the covariance matrix associated with the observation process (i.e., the deformed Brownian motion). Furthermore, it is shown in [30, 37] that ${C}_{i}$ can be estimated by the covariance matrix of $\{u_{k}\}_{k=i-L}^{i+L}$ , where $L\in\mathbb{N}$ is chosen by the user. The key relevant fact here is that since the covariance matrix of $\mathrm{d}u_{t}$ is $\nabla\Phi|_{\theta_{t}}\nabla\Phi|_{\theta_{t}}^{\top}$ , the related open set $O(u_{t})$ is an ellipsoid with principle semi-axes described by the non-degenerate eigenvalues and eigenvectors of $\nabla\Phi|_{\theta_{t}}\nabla\Phi|_{\theta_{t}}^{\top}$ .

To conclude, the intrinsic signal model in (15) and the observation model (16) are generic and can describe a broad range of applications. Therefore, the ability to recover the distances between samples of the intrinsic process from observations in a nonparametric and unsupervised manner using an estimate of the Mahalanobis distance is very powerful. Indeed, this model was used for system identification [40], molecular dynamics [11], sleep analysis [45, 22], model reduction [10], speech processing [9], and gene expression data [17].

3.4. Shrinkage Estimators

For any $p$ -by- $p$ matrix $M_{n}$ estimated from $y_{1},\ldots,y_{n}$ , consider the estimator for MD

[TABLE]

using Definition 3. The extension to Definition 2 is straight-forward. In order to quantitatively measure the performance of any MD estimator $d_{M_{n}}(z,X)$ , it is useful to introduce a loss function. For any estimator of the form (19), the absolute value of the estimation error with respect to the true value (10) is

[TABLE]

As the test vector $z$ is arbitrary, it is natural to consider the worst case, and define the loss of $M_{n}$ at the (unknown) underlying low-dimensional covariance $\Sigma_{X}$ :

Definition 4.

The worst case loss function of an estimator of the form (19) for (10) is defined as

[TABLE]

where $\norm{\cdot}_{\textmd{op}}$ is the matrix operator norm.

It is also reasonable to consider the root mean squared estimation error of all possible test vectors. The discussion below follows the same line. To keep the notation light, the dependence of $L_{n}$ on $\mu$ and $\sigma$ as well as the dependence of $M_{n}$ on the sample $y_{1},\ldots,y_{n}$ are implicit.

Consider matrices of the form $M^{\eta}_{n}:=\eta(S_{n})$ , where $\eta:[0,\infty)\to[0,\infty)$ and $S_{n}$ is the sample covariance. We call $M^{\eta}_{n}$ the shrinkage estimator of $\Sigma_{X}^{\dagger}$ with $\eta$ . A typical example is the classical MD estimator, which is a shrinkage estimator with $\eta=\eta^{\textup{classical}}_{\sigma}$ , where

[TABLE]

From [23], in the traditional setup when the dimension $p$ is fixed and $n\rightarrow\infty$ , the classical MD estimator obtains zero loss asymptotically.

Theorem 1.

Let $p$ be fixed independently of $n$ . Then

[TABLE]

Proof.

Since it is well known that $(S_{n}-\sigma^{2}I_{p})\to\Sigma_{X}$ as $n\to\infty$ , substituting $M_{n}$ with $\eta^{\textup{classical}}_{\sigma}(S_{n})$ in (20) and taking limit with $n\to\infty$ complete the proof. ∎

When $p$ grows with $n$ , such that $p=p_{n}\to\infty$ with $p_{n}/n\to\beta>0$ , the situation is quite different. It is known that in this situation the sample covariance matrix is an inconsistent estimate of the population covariance matrix [15], and Theorem 1 might not hold; that is, the classical MD estimator might not be optimal. The following questions naturally arise when $\beta>0$ :

(1)

Is there an optimal shrinkage (OS) estimator with respect to the loss $L_{n}$ ? 2. (2)

How does the loss of the optimal shrinkage estimator compare with the loss $L_{n}(\eta^{\textup{classical}}_{\sigma}(S_{n}),\,\Sigma_{X}^{\dagger})$ ?

In the sequel, we attempt to answer these questions.

4. OPTIMAL RECOVERY OF PRECISION MATRIX FOR MAHALANOBIS DISTANCE UNDER THE SPIKED MODEL

We start the derivation of the OS for MD under the linear spiked model, which involves the OS for the precision matrix. Its extension to the manifold model requires only an additional mild condition, which will be discussed in Section 5. Without loss of generality, we set the noise level $\sigma=1$ and will discuss the general case subsequently.

Assumption 1 (Asymptotic( $\beta$ )).

The number of variables $p=p_{n}$ grows with the number of observations $n$ , such that $p/n\to\beta$ as $n\to\infty$ , for $0<\beta\leq 1$ .

Assumption 2 (Spiked model).

Suppose $\Sigma_{X}=\Sigma_{Y}-\sigma^{2}I_{p}$ with the eigendecompostion:

[TABLE]

where $d\geq 0$ , $\Sigma_{d}=\textup{diag}(\ell_{1},\cdots\ell_{d})$ is a $d\times d$ matrix whose diagonal consists of $d$ spikes $\ell_{1}>\cdots>\ell_{d}>0$ , which are fixed and independent of $p$ and $n$ , and the off-diagonal elements are set to zero. For completeness, denote $\ell_{d+1}=\ldots=\ell_{p}=0$ . Note that we assume that all spikes are simple. When $d=0$ , it is the null case.

Denote the eigendecompostion of $S_{n}$ as

[TABLE]

where $\lambda_{1,n}\geq\ldots\lambda_{p,n}\geq 0$ are the empirical eigenvalues and $V_{n}\in O(p)$ is the matrix, whose columns are the empirical eigenvectors $v_{i,n}\in\mathbb{R}^{p}$ , $i=1,\ldots,p$ . Under Assumption 1 and Assumption 2, results collected from [25, 2, 3, 28] imply three important facts about the sample covariance matrix $S_{n}$ .

(1)

Eigenvalue spread. Suppose Assumption 1 holds and consider the null case where $\Sigma_{d}=0$ . As $n\to\infty$ , the spread of the empirical eigenvalues $\lambda_{i,n}$ converges to a continuous distribution called the “Marcenko-Pastur” law [25],

[TABLE]

where $\lambda_{+}=(1+\sqrt{\beta})^{2}$ and $\lambda_{-}=(1-\sqrt{\beta})^{2}$ are the limiting bulk edges. 2. (2)

Top eigenvalue bias. Suppose Assumption 1 and Assumption 2 hold. For $1\leq i\leq d$ , the empirical eigenvalues

[TABLE]

as $n\to\infty$ , where

[TABLE]

is defined on $\alpha\in[0,\infty)$ and $\ell_{+}:=\sqrt{\beta}$ . For $d+1\leq i\leq p$ , since $\ell_{i}=0$ the empirical eigenvalues $\lambda_{i,n}$ follow the Marcenko-Pastur law (24). 3. (3)

Top eigenvector inconsistency. Suppose Assumption 1 and Assumption 2 hold. Let $c_{i,n}$ and $s_{i,n}$ be the cosine and sine values of the angle between the $i$ -th population eigenvector and the $i$ -th empirical eigenvector after properly adjusting the sign of each empirical eigenvector. Note that there exists a sequence of $R_{n}\in O(p)$ so that $R_{n}V_{n}$ converges almost surely (a.s.) to $V\in O(p)$ . In the following we assume that the empirical eigenvectors have been properly rotated. It is known that when $n\to\infty$ , $c_{i,n}\xrightarrow{a.s.}c(\ell_{i})$ and $s_{i,n}\xrightarrow{a.s.}s(\ell_{i})$ , where

[TABLE]

and

[TABLE]

are defined on $\alpha\in[0,\infty)$ .

The above three properties imply that the classical estimator $\eta_{\sigma}^{\textup{classical}}(S_{n})$ may not be the best estimator in general, and for the purpose of estimating MD in particular. Inspired by [7], we may “correct” the bias of the eigenvalues to improve the estimation.

Definition 5.

The asymptotic loss function is defined as

[TABLE]

assuming the limit exists.

To find a shrinkage estimator $\eta$ that minimizes $L_{\infty}(\eta|\ell_{1},\ldots,\ell_{d})$ , it is natural to construct the estimator by recovering the spikes $\ell_{i}$ using the biased eigenvalues $\lambda_{i}$ . From the inversion of (25), recalling that $\ell_{+}=\sqrt{\beta}$ , we can define

[TABLE]

when $\alpha>\lambda_{+}$ , and consider the shrinkage function

[TABLE]

Note that since taking inverse of a matrix is nonlinear, it is not clear if the above naive idea will lead to the OS estimator of the precision matrix for the MD estimate. Nevertheless, it is reasonable to expect the existence of an optimal shrinkage function $\eta^{*}$ satisfying

[TABLE]

for any spikes $\ell_{1},\ldots,\ell_{d}$ . Below we show that this naive idea, $\eta^{\textup{inv}}$ , is in fact the OS estimator for our purpose.

4.1. Derivation of the Optimal Shrinker when $\sigma=1$

Definition 6.

A function $\eta:[0,\infty)\to[0,\infty)$ is called a shrinker if it is continuous when $\lambda>\lambda_{+}$ , and $\eta(\lambda)=0$ when $0\leq\lambda\leq\lambda_{+}$ .

Note that this shrinker is a bulk shrinker considered in [7, Definition 3]. Based on the assumption of a shrinker $\eta$ , the associated shrinkage estimator converges almost surely, that is

[TABLE]

where the right hand side is the eigendecomposition of $M^{\eta}$ . Thus, the sequence of loss functions also almost surely converges as

[TABLE]

As a result, the limit in (28) exists when $\eta$ is a shrinker, and we have the following theorem which in turn gives rise to the optimal shrinker. Note that while the biased eigenvalues could be recovered by the quadratic relationship between the biased eigenvalues and the population eigenvalues (25), true (population) eigenvectors and empirical eigenvectors are not collinear [7] and so far we do not have a way to recover the biased eigenvectors. The main idea beyond the proof is respecting this fact, and when we find the optimal way to correct the eigenvalues, the biased eigenvectors should be taken into account. With the control of this eigenvector bias, the OS will be derived.

Theorem 2 (Characterization of the asymptotic loss).

Suppose $\sigma=1$ . Consider the spiked covariance model satisfying Assumption 1 and Assumption 2 and a shrinkage function $\eta:[0,\infty)\to[0,\infty)$ . We have a.s.

[TABLE]

where $\Delta:[0,\infty)\times[0,\infty)\to[0,\infty)$ is given by

[TABLE]

where

[TABLE]

Proof.

Based on the property of “simultaneous block-diagonalization” for $\Sigma_{X}^{\dagger}$ and $M^{\eta}_{n}$ in [7, Section 2], the properties of “orthogonal invariance” and “max-decomposability” for the operator norm in [7, Section 3], and the convergence of $c_{i,n}$ and $s_{i,n}$ in (26) and (32), we have

[TABLE]

where

[TABLE]

when $\ell_{i}\neq 0$ and $A_{i}=0_{2\times 2}$ otherwise, and

[TABLE]

When $n\to\infty$ , the loss converges a.s. to $\max_{i}\norm{A_{i}-B_{i}}_{\textmd{op}}$ , where

[TABLE]

Now we evaluate $\norm{A_{i}-B_{i}}_{\textmd{op}}$ for different $\ell_{i}$ .

When $\ell_{i}>\ell_{+}$ , denote the eigenvalues of $A_{i}-B_{i}$ as $u_{+}(\ell_{i},\eta(\lambda_{i}))$ and $u_{-}(\ell_{i},\eta(\lambda_{i}))$ . If $\eta(\lambda_{i})>1/\ell_{i}$ we have $0\leq u_{+}(\ell_{i},\eta(\lambda_{i}))\leq-u_{-}(\ell_{i},\eta(\lambda_{i}))$ , and hence $\norm{A_{i}-B_{i}}_{op}=-u_{-}(\ell_{i},\eta(\lambda_{i}))$ ; otherwise, we have $u_{+}(\ell_{i},\eta(\lambda_{i}))\geq-u_{-}(\ell_{i},\eta(\lambda_{i}))\geq 0$ , and hence $\norm{A_{i}-B_{i}}_{op}=u_{+}(\ell_{i},\eta(\lambda_{i}))$ . For $0<\ell_{i}\leq\ell_{+}$ , since $c(\ell_{i})=0$ , we have

[TABLE]

which equals $0_{2\times 2}$ since $\eta(\lambda_{i})=0$ by the definition of shrinkage function. Thus, $\norm{A_{i}-B_{i}}_{op}=1/\ell_{i}$ . Finally, for $\ell_{i}=0$ , $A_{i}$ is a $2\times 2$ zero matrix, and thus $\norm{A_{i}-B_{i}}_{op}=\eta(\lambda_{i})=0$ . This concludes the proof. ∎

Figure 1 illustrates the obtained asymptotic loss for several $\beta=p/n$ values as a function of the spike strength in a single spike model. It is clear that for each $\beta$ , there is a transition at $\ell_{+}=\sqrt{\beta}$ . An immediate consequence of Theorem 2 is that $\eta^{\textup{inv}}$ is an optimal shrinker.

Corollary 1.

Suppose $\sigma=1$ and Assumption 1 and Assumption 2 hold. Define the asymptotically optimal shrinkage function as

[TABLE]

where argmin is evaluated on the set of all possible shrinkage functions. Then, $\eta^{*}$ is unique and equals $\eta^{\textup{inv}}$ given in (30). Moreover, its associated loss is

[TABLE]

where

[TABLE]

Note that this result coincides with the findings reported in [7]. Precisely, it is shown in [7, (1.12)] that for the operator norm, $\ell(\alpha)$ (29) is the optimal shrinkage for the covariance matrix and precision matrix. In this corollary, we show that for the Mahalanobis distance, which is related to the precision matrix, the optimal estimator is also achieved by the optimal shrinkage, taking $\ell(\alpha)$ into account.

Proof.

Based on Theorem 2, the optimal shrinker $\eta^{*}$ leads to $\underset{\eta\geq 0}{\min}\underset{i=1,\ldots,d}{\max}\{\Delta(\ell_{i},\eta(\lambda_{i}))\}$ . Note that for $j=\underset{i=1,\ldots,d}{\arg\max}\{\Delta(\ell_{i},\eta(\lambda_{i}))\}$ , the optimal shrinker achieves $\underset{\eta\geq 0}{\min}\{\Delta(\ell_{j},\eta(\lambda_{j}))\}$ . Thus, by the same argument in [7], if we could solve $\underset{\eta\geq 0}{\arg\min}\{\Delta(\alpha,\eta(\lambda(\alpha)))\}$ for any $\alpha>0$ , we find the optimal shrinker. To simplify the notation, we abbreviate $\eta(\lambda(\alpha))$ by $\eta$ .

For $\alpha>\ell_{+}$ and $\eta>\frac{1}{\alpha}$ , we have $\Delta(\alpha,\eta)=-u_{-}(\alpha,\eta)$ . By a direct calculation, we get

[TABLE]

For $\alpha>\ell_{+}$ and $0\leq\eta\leq\frac{1}{\alpha}$ , we have $\Delta(\alpha,\eta)=u_{+}(\alpha,\eta)$ , and similarly by taking the derivative of (35) we have

[TABLE]

As a result, the partial derivative of the loss function is decreasing when $0\leq\eta\leq 1/\alpha$ and increasing when $\eta>1/\alpha$ with a discontinuity at $\eta=1/\alpha$ while the loss function is continuous. Thus, the loss function reaches the minimum when $\eta=1/\alpha$ . These facts imply that $\eta^{*}(\lambda_{i})=1/\ell(\lambda_{i})$ when $\lambda_{i}>\ell_{+}$ . Furthermore, by substituting $\eta$ with $\eta^{*}$ in (35) or (36), we get $\Delta(\alpha)=s(\alpha)/\alpha$ . By definition, $\eta^{*}=0$ when $0\leq\alpha\leq\ell_{+}$ . Thus, for $0<\alpha\leq\ell_{+}$ , $\Delta(\alpha)=1/\alpha$ , and for $\ell=0$ , $\Delta(\ell)=0$ . Finally, it is clear that $\eta^{*}$ is continuous when $\alpha>\lambda_{+}$ , and $\eta(\alpha)=0$ when $0\leq\alpha\leq\lambda_{+}$ . We thus conclude that $\eta^{*}$ is the optimal shrinker. ∎

We compare our result with another naive approach; that is, obtaining the covariance by the optimal shrinkage with respect to the operator norm, and then taking the Moore-Penrose pseudo-inverse. Let $\eta^{cov}$ denote the optimal shrinker for the covariance matrix recovery obtained from [7] with respect to the operator norm loss function. With the same notation, that is, $\Sigma_{Y}=\Sigma_{X}+\sigma^{2}I_{p}$ and $\sigma=1$ as in (2), the optimal shrinker satisfies

[TABLE]

where $\ell$ is defined in (29). Note that in [7], the authors aimed to recover $\Sigma_{Y}$ , while we recover $\Sigma_{X}^{\dagger}$ . In other words, the authors in [7, (1.12)] showed that when the operator norm is considered, the OS estimator is the same as correcting the eigenvalues according to the relationship (25). Thus, our result $\eta^{*}$ coincides with the inverse of $\eta^{cov}-1$ when $\alpha>\lambda_{+}$ .

We note the following interesting phenomenon stemming from Theorem 2 and Corollary 1. If there exists a nontrivial spike $\ell_{i}>0$ that is weak enough so that $\ell_{i}$ is sufficiently small compared with $\ell_{+}$ , then $L_{\infty}(\eta|\ell_{1},\ldots,\ell_{d})$ is dominated by $1/\ell_{i}$ . Consequently, in this large $p$ large $n$ regime, we cannot “rescue” this spike, and the associated signal is lost in the noise, as can be seen in Corollary 1.

Figure 2 illustrates the obtained optimal shrinker with the classical shrinker overlay, for $\beta=p/n=1$ and $\sigma=1$ . Clearly, compared with the classical shrinker, the obtained optimal shrinker truncates the eigenvalues more aggressively.

4.2. Derivation of the Optimal Shrinker when $\sigma\neq 1$

To handle the general case when $\sigma\neq 1$ , we first rescale the data and model by setting $\ell^{\prime}_{i}:=\ell_{i}/\sigma^{2}$ and $\lambda^{\prime}_{i,n}:=\lambda_{i,n}/\sigma^{2}$ , and consider the following shrinker defined on $[0,\infty)$ :

[TABLE]

Note that since $\eta$ plays the role of estimating the precision matrix, we re-normalize it by dividing $\eta(\alpha/\sigma^{2})$ by $\sigma^{2}$ . The shrinkage estimator for $\Sigma_{X}^{\dagger}$ becomes $M^{\eta_{\sigma}}_{n}:=\eta_{\sigma}(S_{n})$ , the general optimal shrinker becomes

[TABLE]

and the associated loss is

[TABLE]

5. EXTENSION TO THE MANIFOLD MODEL

We now come back to the manifold learning problem with the manifold setup described in Section 2.1. We argue that despite the challenge mentioned in Section 3.1, the developed theorem in Section 4.1 can be extended to study the manifold model in the large $p$ large $n$ setup with proper modifications. Note that For each $n$ , there exists an orthonormal basis $\{u_{n,l}\}_{l=1}^{p}$ of $\mathbb{R}^{p}$ so that the $d$ -dimensional compact smooth manifold $M$ is isometrically embedded in the subspace spanned by $\{u_{n,l}\}_{l=1}^{K}$ for a fixed $K\in\mathbb{N}$ . In other words, while the rank of $\Sigma_{x}$ associated with $\mathcal{O}(x)$ depends on $x$ , it is bounded uniformly from above by $K$ . Note that $K$ in general can be much larger than $d$ , yet it is fixed due to the well known Nash’s isometric embedding theory [27], which guarantees the existence of $K$ that is independent of $p$ and $K\leq d(3d+11)/2$ . We put the following assumption.

Assumption 3.

Assume $\mathcal{O}(x)=\phi(B_{\epsilon}(0))$ , where $B_{\epsilon}(0)$ is a Euclidean ball centered at [math] with the radius $\epsilon>0$ , and $\phi:\mathbb{R}^{d}\to M$ is diffeomorphic on $B_{\epsilon}(0)$ . We call $\mathcal{O}(x)$ an ellipsoid.

Note that since $\phi$ is diffeomorphic, in general $\mathcal{O}(x)$ is not really an ellipsoid unless $\phi$ is linear. But to simplify the terminology, we abuse the notation and still call it an ellipsoid. Also note that the elliptic radii of $\mathcal{O}(x)$ is of order $\epsilon>0$ with the implied constant depending on the Jacobian of $\phi$ when $\epsilon$ is sufficiently small.

Based on the developed theorem in Sections 4.1 and 4.2, we state the following theorem that secures the recovery of MD under the manifold model in Definition 2. The basic idea beyond this theorem is twofold. First, it relies on the result from [24, Lemma 3 and Lemma 6] stating that when $\epsilon>0$ is sufficiently small, there is a sufficiently large gap between the first $d$ eigenvalues of $\Sigma_{x}$ and the remaining small eigenvalues. Second, as discussed in Section 4.1, any nontrivial eigenvalue $\ell_{i}$ that satisfies $\ell_{i}\ll\ell_{+}$ is ignored by the optimal shrinkage.

Theorem 3.

Assume Assumptions 1-3 hold. Fix $x\in M$ . Suppose $\sigma=\sigma(\epsilon)$ so that $\sigma^{2}\sqrt{\beta}\epsilon^{-d-2}\to 0$ and $\sigma^{2}\sqrt{\beta}\epsilon^{-d-4}\to\infty$ when $\epsilon\to 0$ . Assume the maximal elliptic radius of $\mathcal{O}(x)$ is $m\epsilon$ , where $m=m(\epsilon)\asymp 1$ , and the ratio of the maximal and minimal elliptic radii is fixed for all $\epsilon>0$ . When $\epsilon$ is sufficiently small, all nontrivial eigenvalues of $\Sigma_{x}$ except the top $d$ eigenvalues are set to zero by the optimal shrinkage $\tilde{\eta}^{*}$ .

Proof.

By [24, Lemma 6], the local covariance matrix has the following asymptotical expansion when $\epsilon$ is sufficiently small:

[TABLE]

where $|S^{d-1}|$ is the volume of $S^{d-1}$ . A derivation, similar to the derivation in [24, Lemma 3], yields that when $\epsilon$ is sufficiently small, the top $d$ eigenvalues of $\Sigma_{x}$ are of order $\epsilon^{d+2}$ , and the remaining eigenvalues are of order equal to or higher than $\epsilon^{d+4}$ . In other words, the “signal strength” is of order $\epsilon^{d+2}$ while the noise strength is $\sigma>0$ . Note that there are at most $K$ non-zero eigenvalues.

By (42), all eigenvalues smaller than $\sigma^{2}\ell_{+}=\sigma^{2}\sqrt{\beta}$ are eliminated by the optimal shrinkage. Combining the above, since $\beta>0$ , $P(x)$ and $m$ are all fixed, when $\epsilon$ is sufficiently small so that $\sigma^{2}\sqrt{\beta}\epsilon^{-d-2}$ is sufficiently small and $\sigma^{2}\sqrt{\beta}\epsilon^{-d-4}$ is sufficiently large, all nontrivial eigenvalues of $\Sigma_{x}$ except the top $d$ eigenvalues are set to zero by the optimal shrinkage $\tilde{\eta}^{*}$ . Indeed, by the assumption that $\sigma^{2}\sqrt{\beta}\epsilon^{-d-2}\to 0$ and $\sigma^{2}\sqrt{\beta}\epsilon^{-d-4}\to\infty$ , we know that $c_{1}\epsilon^{d+2}\geq\sigma^{2}\sqrt{\beta}$ and $c_{2}\epsilon^{d+4}\leq\sigma^{2}\sqrt{\beta}$ when $\epsilon$ is sufficiently small for some universal constants $c_{1}$ and $c_{2}$ . Therefore, $\lambda_{l}\geq\sigma^{2}\sqrt{\beta}$ for $l=1,\ldots,d$ and $\lambda_{l}\leq\sigma^{2}\sqrt{\beta}$ for $l=d+1,\ldots,K$ , and hence we conclude the proof. ∎

We conclude this section with several remarks on this theorem. First, the condition $\sigma^{2}\sqrt{\beta}\epsilon^{-d-2}\to 0$ seems to be limited since the noise level goes to zero when $\epsilon$ goes to zero. This condition is needed if we want to properly estimate the precision matrix locally within an ellipsoid over a manifold with elliptic radii of order $\epsilon$ . In practice, $\epsilon$ plays the role of “bandwidth” which reflects the “resolution” of how accurate we could estimate the quantity we have interest. For example, in the motivating EIG example discussed in Section 3.3, we need an accurate precision matrix estimation so that the geodesic distance in the latent space can be accurately estimated. Such precision matrix via (17) is less affected by the curvature if $\epsilon$ is small. However, if the noise level $\sigma$ is fixed, then $\epsilon$ is bounded from below, so that we have a limited accuracy when we recover the geodesic distance on the latent space, and hence the latent space itself by DM.

Second, this theorem also suggests that we always need the noise to attain a reasonable estimate of MD; that is $\sigma^{2}\sqrt{\beta}\epsilon^{-d-4}\to\infty$ . This statement is certainly counterintuitive, since in the linear spiked model, noise absence is desired and beneficial. To clarify this point, note that in general the dimension of the manifold is unknown, but it is needed so that we can define a sensible MD on the manifold in Definition 2. See the discussion in Section 3.2. Thus, a dimension estimation is needed. This theorem mainly says that when $\sigma>0$ is sufficiently large so that the condition in Theorem 3 holds, we could get the MD in Definition 2 even if we do not know the dimension. If the noise is not sufficiently big and we do not know the dimension, then we cannot obtain the desired MD. If the dimension is known, or can be robustly estimated when noise exists, then this lower bound condition can be removed. The above facts stem from the complicated nonlinear interaction of nonlinear manifold structure and high dimensional noise. Finally, we mention that this result extends the EIG study in [24] under the manifold model when noise exists.

6. SIMULATION STUDY

To numerically compare the optimal shrinker and the classical shrinker, we set $\beta=0.2,0.4,\ldots,1$ and consider the number of samples $n=300$ so that $p=\beta n$ . For simplicity, we set $\ell_{i}=i$ for $i=1,\ldots,d$ . We consider $d=\{1,4\}$ . Suppose $x_{i}$ , $i=1,\ldots,n$ are sampled i.i.d. from the random vector $\sum_{\ell=1}^{d}\zeta_{\ell}e_{\ell}$ , where $e_{\ell}\in\mathbb{R}^{p}$ is the unit vector with $\ell$ -th entry $1$ , $\zeta_{\ell}\sim N(0,1)$ for $\ell=1,\ldots,d$ , and $\zeta_{\ell}$ is independent of $\zeta_{k}$ when $\ell\neq k$ . The noisy data is simulated by $y_{i}=Ax_{i}+\sigma^{2}\xi$ , where $\xi\sim\mathcal{N}\left(0,I_{p}\right)$ is the noise matrix, $\xi$ is independent of $\zeta_{\ell}$ and $A\in O(p)$ is randomly sampled from $O(p)$ . In the simulation, we take $\sigma=0.225,0.45,\ldots,1.8$ . For each $\sigma$ , we repeat the experiment $200$ times and report the mean and variance of the loss $L_{n}$ .

Figure 3 shows the loss of the optimal and classical shrinkers when $d=1$ and $d=4$ in a logarithmic scale. We observe that the loss using the classical shrinker is significantly larger. This stems from the fact that in the large $p$ and large $n$ regime, there are eigenvalues greater than $\sigma^{2}$ that are not associated with the signal. When applying the classical shrinker (21) (the Moore-Penrose pseudo-inverse), these irrelevant eigenvalues contribute significantly, leading to high loss. Conversely, the optimal shrinker is much more ‘selective’ (as illustrated in Figure 2), associating larger eigenvalues with the noise, thereby increasing the robustness of the estimator.

Recall that our main motivation for considering MD in the high-dimensional regime $p\asymp n$ comes from manifold learning. Next, we test the performance of the proposed OS algorithm on high-dimensional data lying on a lower-dimensional manifold. Consider the following model

[TABLE]

where $X$ is sampled from a curved manifold with one chart $\mathcal{M}$ embedded in $\mathbb{R}^{p}$ that is parametrized by

[TABLE]

$s,t\in[-5,5]$ , $\xi\sim\mathcal{N}\left(0,I_{p}\right)\in\mathbb{R}^{p}$ , and $\sigma>0$ is the noise level. Note that since the precision matrix estimation and MD estimation are both local, we consider this one-chart manifold without loss of generality. Suppose the ambient dimension is fixed at $p=100$ . For various values of $\beta>0$ , we sample $n=\lceil p/\beta\rceil$ pairs $(s,t)$ uniformly from $[-5,5]\times[-5,5]$ and generate $n$ nonuniform points on $\mathcal{M}$ by (45). Then, these points are corrupted by additive noise with various levels $\sigma$ according to (44). The normalized loss of MD is computed by

[TABLE]

where $y\in\mathcal{M}\subset\mathbb{R}^{p}$ is an arbitrary point on the manifold and $M_{n}$ is the estimated covariance matrix. We examine two reference points: $y_{1}=[0,\ldots,0]\in\mathcal{M}$ and $y_{2}=[2,2,4,0,\ldots,0]\in\mathcal{M}$ . For each case, the simulation was repeated for $500$ times, and the mean and standard deviation of errors are reported. In Table 1, we compare the performance of the OS estimator, $\tilde{\eta}^{*}(S_{n})$ , where $S_{n}$ is the sample covariance, to the performance of the classical estimator $\eta^{\text{classical}}_{\sigma}(S_{n})$ . We observe that the OS outperforms the classical estimator in this well-controlled manifold setup, and we could see that the larger the noise, the worse the performance is. Moreover, the higher the dimension is, that is, the larger the $\beta$ is, the worse the performance is.

7. APPLICATION TO DYNAMICAL SYSTEM ANALYSIS

Next, we apply our OS estimator on the multiscale reduction problem studied in [10]. We consider the following two-dimensional stochastic differential equation (SDE)

[TABLE]

where $W_{1}$ and $W_{2}$ are independent standard Bronwian motion and $\epsilon>0$ is a small constant quantifying the scale of $x_{2}$ . This SDE defines a dynamical system with two time scales, where $x_{1}$ is a slow variable and $x_{2}$ is a fast variable. Suppose the state of the system $(x_{1}(t),\,x_{2}(t))$ is hidden and we have access to it through an embedding in a high dimension space via the map

[TABLE]

In addition, suppose the high-dimensional observation $\mathbf{y}(t)$ is contaminated by noise:

[TABLE]

where

[TABLE]

$\Omega_{i}(t)$ are independent standard Brownian motion, and $\sigma$ is the noise level.

In [10], diffusion maps [5] with a Gaussian kernel based on the MD [30] was applied to effectively reduce the dimensionality of the system by attenuating the fast variable and recovering the slow variable. Specifically, the following kernel matrix $\mathbf{W}\in\mathbb{R}^{N\times N}$ was used:

[TABLE]

where $\|\cdot\|_{M}$ is defined by

[TABLE]

where $\mathbf{C}(\mathbf{y}(t))$ is the local population covariance of the observed stochastic process at the point $\mathbf{y}(t)$ , and $\sigma_{kernel}$ is the selected kernel scale.

We generate $N$ samples of the noisy observation $\mathbf{z}(t_{i})$ , where $t_{i}:=idt$ and $i=1,\ldots,N$ . At each sampling point $t_{i}$ , we generate a trajectory consisting of $q$ samples of noisy observations with sampling time interval $\delta t$ starting at $\mathbf{z}(t_{i})$ , such that we have $q$ samples: $\mathbf{z}(t^{\prime}_{1}),\mathbf{z}(t^{\prime}_{2}),...,\mathbf{z}(t^{\prime}_{q})$ drawn approximately from the local distribution at $\mathbf{z}(t_{i})$ . Here, we set $\delta t=o(\epsilon)$ , where $\epsilon>0$ is from (46). Let $\mathbf{\bar{z}}$ denote the local sample mean of $\mathbf{z}(t^{\prime}_{1}),\ldots,\mathbf{z}(t^{\prime}_{q})$ . The local sample covariance is given by

[TABLE]

For the estimation of the local precision at $\mathbf{y}(t_{i})$ , that is, $\mathbf{C}^{\dagger}(\mathbf{y}(t_{i}))$ from the noisy samples $\mathbf{z}(t)$ , we test two methods. The first method is based simply on the pseudo-inverse of $\hat{\mathbf{C}}(\mathbf{y}(t_{i}))$ . The second method is based on the OS estimator $\eta_{\sigma}$ that is applied to (52) to estimate the local precision estimation. Once the local precision matrices are estimated, we construct the following distance matrix $\mathbf{D}$ , whose elements are given by

[TABLE]

We examine three cases of observation functions $f_{1}$ and $f_{2}$ :

Case I.

Let

[TABLE]

and set $\epsilon=10^{-3}$ , $dt=10^{-4}$ , $\delta t=10^{-7}$ , $N=3000$ , $p=50$ , $q=50$ , $\sigma=0.1$ , and $\sigma_{kernel}$ to the $20$ percent quantile of the distance matrix $\mathbf{D}$ acquired from (53). These parameters are taken from [10].

Case II.

Let

[TABLE]

and set the parameters as in Case I.

Case III.

Let

[TABLE]

We set the parameters as in Case I and II, except for $\delta t$ , which is set to a finer value $10^{-10}$ , and $\sigma_{kernel}$ , which is set to $5$ percent quantile of the distance matrix $\bf{D}$ . These observations functions create a “half-moon” shape, and where considered in [10].

In Fig. 4, we compare the results obtained for Case I by diffusion maps based on three distance matrices. The first distance matrix is a variant of (53), where the noisy signal $\mathbf{z}(t)$ is replaced by the clean signal $\mathbf{y}(t)$ . These results are plotted in the leftmost column. The second distance matrix is (53), where the estimates of the precision matrix are based on the pseudo-inverse. These results are plotted in the middle column. The third distance matrix is (53), where the estimates of the precision matrix are based on the proposed OS. These results are plotted in the rightmost column. In the first row, we show a scatter plot of the slow variable $x_{1}$ against its most correlated eigenvector of the transition matrix associated with the diffusion map. In the second row, we show the scatter plot of the fast variable $x_{2}$ against its most correlated eigenvector of the transition matrix associated with the diffusion map. The correlation between the respective variables and eigenvectors is shown in red above each subplot. In the third row, we present the eigenvalues of the transition matrix associated with the diffusion map, where $d_{k}$ is the $k$ -th largest eigenvalue. The eigenvalues corresponding to the eigenvectors presented in the first and second rows (corresponding to the slow and fast variables) are marked by red and blue circles, respectively. Fig. 5 and Fig. 6 are the same as Fig. 4, but for Case II and Case III.

We observe that all three figures present consistent results and trends. Focusing first on the leftmost column, we see that the use of MD in diffusion maps based on the clean signal recovers accurately the slow variable $x_{1}$ and attenuates the fast variable $x_{2}$ , conforming to the results presented in [10] for only a 2-dimensional observations. In the middle column, we see that the addition of noise hinders both the recovery of the slow variable as well as the attenuation of the fast variable, when the pseudo-inverse is used for the estimation of the MD. In contrast, in the rightmost column, we see that when the proposed OS is used, the obtained results based on the noisy signal are comparable to the results obtained based on the clean signal in the leftmost column.

These results demonstrate, in the context of manifold learning, that the estimation of the MD based on the pseudo-inverse fails in the high-dimensional regime with the presence of noise. In addition, they show that using the proposed optimal shrinker indeed offers a remedy and gives rise to accurate manifold learning.

8. CONCLUSIONS

We proposed a new estimator for MD based on precision matrix shrinkage. For an appropriate loss function, we show that the proposed estimator is asymptotically optimal and outperforms the classical implementation of MD using the Moore-Penrose pseudo-inverse of the sample covariance. Importantly, the proposed estimator is particularly beneficial when the data is noisy and in high-dimension, a case in which the classical MD estimator might completely fail. Consequently, we believe that the new estimator may be useful in modern data analysis applications, involving for example, local principal component analysis, metric design, and manifold learning.

In this work, we focused on the case in which the intrinsic dimensionality of the data (the rank of the covariance matrix) $d$ is unknown, and therefore, it was not explicitly used in the estimation. Yet, in many scenarios, this dimension is known. In this case, it could be beneficial to consider a direct truncation and use only the top $d$ eigen-pairs for the estimation of the precision matrix. While the benefit from such a truncation has been shown empirically in several applications [38, 45, 48, 22], it still requires a systematic investigation. For example, identifying the rank of the signal, or estimating the dimension of a manifold, are by themselves highly challenging tasks [16, 19]. Note that in the particular manifold setup, knowing the manifold dimension is essentially different from the rank-aware shrinker discussed in [7]; as we showed here, under the manifold setup, the rank of the covariance matrix associated with points residing inside a small neighborhood of any point on the manifold could be much larger than $d$ . Since the focus of this paper is MD recovery, the associated loss function for the OS of the precision matrix is the operator norm. As is discussed in [7], there are other loss functions that we can choose, like the Frobenius norm, the nuclear norm, etc. It would be interesting to explore if the OS for the precision matrix under those norms when they are needed. We shall also mention that the manifold model discussed in this paper is a simplified model for more complicated datasets with more nonlinear structure. While it is possible to apply the principle component analysis approach to denoise the data when the smooth compact manifold assumption holds thanks to the fixed $K$ mentioned in Section 5, in general this approach might not be optimal, particularly when the geometric structure is more complicated. The current work explores this situation with a simplified manifold model and paves a way toward more complicated setups, and these cases will be explored in our future work.

Acknowledgements

MG has been supported by H-CSRC Security Research Center and Israeli Science Foundation grant no. 1523/16. MG and RT were supported by a grant from the Tel-Aviv University ICRC Research Center. We thank the anonymous reviewers for their constructive comments and critiques that highly improve the original manuscript.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sankaraleengam Alagapan, Hae Won Shin, Flavio Fröhlich, and Hau-Tieng Wu. Diffusion geometry approach to efficiently remove electrical stimulation artifacts in intracranial electroencephalography. Journal of neural engineering , 2018.
2[2] J. Baik, G. Ben Arous, and S. Peche. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices. Ann. Probab. , 33(5):1643–1697, 2005.
3[3] J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis , 97(6):1382 – 1408, 2006.
4[4] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation , 15(6):1373–1396, 2003.
5[5] R. R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic analysis , 21(1):5–30, 2006.
6[6] D. Dai and T. Holgersson. High-dimensional CL Ts for individual mahalanobis distances. In Müjgan Tez and Dietrich von Rosen, editors, Trends and Perspectives in Linear Statistical Inference , pages 57–68, 2018.
7[7] D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model. The Annals of Statistics , 46(4):1742–1778, November 2018.
8[8] D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. P. Natl. Acad. Sci. USA , 100(10):5591–5596, 2003.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Optimal Recovery of Precision Matrix for Mahalanobis Distance from High Dimensional Noisy Observations in Manifold Learning

Abstract.

1. INTRODUCTION

2. PROBLEM SETUP

2.1. Manifold model

Definition 1**.**

2.2. Linear spiked model

3. PRECISION MATRIX AND MAHALANOBIS DISTANCE ESTIMATION

3.1. Precision matrix estimation and its challenge

3.2. Mahalanobis distance

Definition 2** (Mahalanobis distance under the manifold model).**

Definition 3** (Mahalanobis distance under the linear spiked model).**

3.3. Motivating example

3.4. Shrinkage Estimators

Definition 4**.**

Theorem 1**.**

Proof.

4. OPTIMAL RECOVERY OF PRECISION MATRIX FOR MAHALANOBIS DISTANCE UNDER THE SPIKED MODEL

Assumption 1** (Asymptotic(β\betaβ)).**

Assumption 2** (Spiked model).**

Definition 5**.**

4.1. Derivation of the Optimal Shrinker when σ=1\sigma=1σ=1

Definition 6**.**

Theorem 2** (Characterization of the asymptotic loss).**

Proof.

Corollary 1**.**

Proof.

4.2. Derivation of the Optimal Shrinker when σ≠1\sigma\neq 1σ=1

5. EXTENSION TO THE MANIFOLD MODEL

Assumption 3**.**

Theorem 3**.**

Proof.

6. SIMULATION STUDY

7. APPLICATION TO DYNAMICAL SYSTEM ANALYSIS

Case I.

Case II.

Case III.

8. CONCLUSIONS

Acknowledgements

Definition 1.

Definition 2 (Mahalanobis distance under the manifold model).

Definition 3 (Mahalanobis distance under the linear spiked model).

Definition 4.

Theorem 1.

Assumption 1 (Asymptotic( $\beta$ )).

Assumption 2 (Spiked model).

Definition 5.

4.1. Derivation of the Optimal Shrinker when $\sigma=1$

Definition 6.

Theorem 2 (Characterization of the asymptotic loss).

Corollary 1.

4.2. Derivation of the Optimal Shrinker when $\sigma\neq 1$

Assumption 3.

Theorem 3.