Principal Component Analysis for Multivariate Extremes

Holger Drees; Anne Sabourin (LTCI)

arXiv:1906.11043·math.ST·June 27, 2019

Principal Component Analysis for Multivariate Extremes

Holger Drees, Anne Sabourin (LTCI)

PDF

TL;DR

This paper introduces a PCA-based method for identifying lower-dimensional subspaces in high-dimensional heavy-tailed data, improving the analysis of extreme events by reducing dimensionality and providing theoretical guarantees.

Contribution

It applies PCA to radially thresholded data in a heavy-tailed setting and proves convergence of the estimated subspace to the optimal one with finite sample guarantees.

Findings

01

Empirical risk converges uniformly to the true risk for all projection subspaces.

02

The estimated subspace converges in probability to the optimal subspace.

03

Finite sample guarantees are provided for the reconstruction error.

Abstract

The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all…

Equations262

μ_{t} (\cdot) := (b (t))^{- 1} P (X \in tB) t \to \infty μ (B) < \infty

μ_{t} (\cdot) := (b (t))^{- 1} P (X \in tB) t \to \infty μ (B) < \infty

\mu\Big{\{}x\in\mathbb{R}^{d}\;:\;\|x\|>r,\frac{x}{\|x\|}\in A\Big{\}}=cr^{-\alpha}H(A),

\mu\Big{\{}x\in\mathbb{R}^{d}\;:\;\|x\|>r,\frac{x}{\|x\|}\in A\Big{\}}=cr^{-\alpha}H(A),

\mathbb{P}\Big{\{}\max_{i\in I}|X^{i}|>t,\max_{i\not\in I}|X^{i}|>t\Big{\}}=o\Big{(}\mathbb{P}\Big{\{}\max_{1\leq i\leq d}|X^{i}|>t\Big{\}}\Big{)},

\mathbb{P}\Big{\{}\max_{i\in I}|X^{i}|>t,\max_{i\not\in I}|X^{i}|>t\Big{\}}=o\Big{(}\mathbb{P}\Big{\{}\max_{1\leq i\leq d}|X^{i}|>t\Big{\}}\Big{)},

Θ_{i} := ω (X_{i}) X_{i}, 1 \leq i \leq n,

Θ_{i} := ω (X_{i}) X_{i}, 1 \leq i \leq n,

θ (x)

θ (x)

θ_{t} (x)

Θ

Θ_{t}

\exists\,\beta\in\Big{(}1-\frac{\alpha}{2},1\Big{]}\;\forall\,\lambda>0,x\in\mathbb{R}^{d}:\quad\omega(\lambda x)=\lambda^{-\beta}\omega(x)\quad\text{and}\quad c_{\omega}:=\sup_{x\in\mathbb{S}^{d-1}}\omega(x)<\infty,

\exists\,\beta\in\Big{(}1-\frac{\alpha}{2},1\Big{]}\;\forall\,\lambda>0,x\in\mathbb{R}^{d}:\quad\omega(\lambda x)=\lambda^{-\beta}\omega(x)\quad\text{and}\quad c_{\omega}:=\sup_{x\in\mathbb{S}^{d-1}}\omega(x)<\infty,

R_{\infty} (V) := P_{\infty} ∥ Π_{V} θ - θ ∥^{2} = P_{\infty} ∥ Π_{V}^{⊥} θ ∥^{2}

R_{\infty} (V) := P_{\infty} ∥ Π_{V} θ - θ ∥^{2} = P_{\infty} ∥ Π_{V}^{⊥} θ ∥^{2}

R_{t}(V):=P_{t}\big{(}\|\mathbf{\Pi}_{V}^{\perp}{\theta})\|^{2}\big{)}=\operatorname{\mathbb{E}}_{t}\big{(}\|\mathbf{\Pi}_{V}^{\perp}{\Theta}\|^{2}\big{)}

R_{t}(V):=P_{t}\big{(}\|\mathbf{\Pi}_{V}^{\perp}{\theta})\|^{2}\big{)}=\operatorname{\mathbb{E}}_{t}\big{(}\|\mathbf{\Pi}_{V}^{\perp}{\Theta}\|^{2}\big{)}

\hat{R}_{t} (V) := \frac{1}{N _{t}} i = 1 \sum n ∥ Π_{V}^{⊥} Θ_{i} ∥^{2} 1 {∥ X_{i} ∥ > t} with N_{t} := i = 1 \sum n 1 {∥ X_{i} ∥ > t} .

\hat{R}_{t} (V) := \frac{1}{N _{t}} i = 1 \sum n ∥ Π_{V}^{⊥} Θ_{i} ∥^{2} 1 {∥ X_{i} ∥ > t} with N_{t} := i = 1 \sum n 1 {∥ X_{i} ∥ > t} .

R_{n, k} (V) = \frac{1}{k} i = 1 \sum n ∥ Π_{V}^{⊥} Θ_{i, \hat{t}_{n, k}} ∥^{2}

R_{n, k} (V) = \frac{1}{k} i = 1 \sum n ∥ Π_{V}^{⊥} Θ_{i, \hat{t}_{n, k}} ∥^{2}

E ∥ Y_{t} ∥^{τ}

E ∥ Y_{t} ∥^{τ}

= \frac{t ^{- τ}}{P { ∥ X ∥ > t }} \int_{t}^{\infty} u^{τ} P^{∥ X ∥} (d u)

= \frac{t ^{- τ}}{P { ∥ X ∥ > t }} τ \int_{t}^{\infty} u^{τ - 1} P {∥ X ∥ > u} d u

\leq 2 \frac{t ^{- τ}}{P { ∥ X ∥ > t }} τ \frac{t ^{τ} P { ∥ X ∥ > t }}{α - τ}

= 2 \frac{τ}{α - τ} .

t lim t^{2 (β - 1)} R_{t} (V) = R_{\infty} (V) .

t lim t^{2 (β - 1)} R_{t} (V) = R_{\infty} (V) .

t^{2 (β - 1)} R_{t} (V) = P_{t} (∥ Π_{V}^{⊥} t^{β - 1} θ ∥^{2}) = \int f (x / t) P_{t} (d x)

t^{2 (β - 1)} R_{t} (V) = P_{t} (∥ Π_{V}^{⊥} t^{β - 1} θ ∥^{2}) = \int f (x / t) P_{t} (d x)

∥ x^{*} - y^{*} ∥^{2}

∥ x^{*} - y^{*} ∥^{2}

\leq (ρ (V, W))^{2} + (1 - ∥ Π_{W} x^{*} ∥)^{2}

\displaystyle\leq(\rho(V,W))^{2}+\Big{(}1-\sqrt{1-(\rho(V,W))^{2}}\Big{)}^{2}

\displaystyle=2\Big{(}1-\sqrt{1-(\rho(V,W))^{2}}\Big{)}.

t \to \infty lim ρ (V_{t}^{*}, V_{\infty}^{*}) = 0.

t \to \infty lim ρ (V_{t}^{*}, V_{\infty}^{*}) = 0.

∣ \tilde{R}_{t} (V) - \tilde{R}_{t} (W) ∣

∣ \tilde{R}_{t} (V) - \tilde{R}_{t} (W) ∣

\displaystyle\leq t^{2(\beta-1)}P_{t}\Big{(}\big{|}\|\mathbf{\Pi}_{V}^{\perp}\theta\|-\|\mathbf{\Pi}_{W}^{\perp}\theta\|\big{|}\,\cdot\,\big{(}\|\mathbf{\Pi}_{V}^{\perp}\theta\|+\|\mathbf{\Pi}_{W}^{\perp}\theta\|\big{)}\Big{)}

\leq 2 t^{2 (β - 1)} P_{t} ∥ θ ∥^{2} ρ (V, W)

\leq 2 c_{ω}^{2} E ∥ Y_{t} ∥^{2 (1 - β)} ρ (V, W)

\leq 2 c_{ω}^{2} \frac{4 ( 1 - β )}{α - 2 ( 1 - β )} ρ (V, W)

ρ (V_{n_{ℓ}}, V^{0}) = x \in S^{d - 1} sup ∥ (U_{n_{ℓ}} U_{n_{ℓ}}^{⊤} - U_{0} U_{0}^{⊤}) x ∥ \to 0.

ρ (V_{n_{ℓ}}, V^{0}) = x \in S^{d - 1} sup ∥ (U_{n_{ℓ}} U_{n_{ℓ}}^{⊤} - U_{0} U_{0}^{⊤}) x ∥ \to 0.

R_{\infty} (V_{\infty}) \leq \tilde{R}_{t_{n}} (V_{\infty}) + \frac{ε}{4} \leq \tilde{R}_{t_{n}} (V_{t_{n}}^{*}) + \frac{ε}{2} \leq \tilde{R}_{t_{n}} (V_{\infty}^{*}) + \frac{ε}{2} \leq R_{\infty} (V_{\infty}^{*}) + \frac{3 ε}{4} < R_{\infty} (V_{\infty}) .

R_{\infty} (V_{\infty}) \leq \tilde{R}_{t_{n}} (V_{\infty}) + \frac{ε}{4} \leq \tilde{R}_{t_{n}} (V_{t_{n}}^{*}) + \frac{ε}{2} \leq \tilde{R}_{t_{n}} (V_{\infty}^{*}) + \frac{ε}{2} \leq R_{\infty} (V_{\infty}^{*}) + \frac{3 ε}{4} < R_{\infty} (V_{\infty}) .

R_{\infty} (V) = 0 ⟺ V_{0} \subset V .

R_{\infty} (V) = 0 ⟺ V_{0} \subset V .

t_{n, k} := F_{∥ X ∥}^{\leftarrow} (1 - k / n) .

t_{n, k} := F_{∥ X ∥}^{\leftarrow} (1 - k / n) .

t_{n, k}^{2 (β - 1)}

t_{n, k}^{2 (β - 1)}

\displaystyle\leq\frac{1}{k}t_{n,k}^{2(\beta-1)}\sum_{i=1}^{n}\|\mathbf{\Pi}_{V}^{\perp}\Theta_{i}\|^{2}\big{|}\bm{1}\{\|X_{i}\|>\hat{t}_{n,k}\}-\bm{1}\{\|X_{i}\|>t_{n,k}\}\big{|}

\displaystyle\leq\Big{[}\frac{1}{k}\sum_{i=1}^{n}t_{n,k}^{(2+\eta)(\beta-1)}\|\Theta_{i}\|^{2+\eta}\bm{1}\{\|X_{i}\|>t_{n,k}\wedge\hat{t}_{n,k}\}\Big{]}^{2/(2+\eta)}

\displaystyle\quad\,\cdot\,\Big{[}\frac{1}{k}\sum_{i=1}^{n}\big{|}\bm{1}\{\|X_{i}\|>\hat{t}_{n,k}\}-\bm{1}\{\|X_{i}\|>t_{n,k}\}\big{|}^{(2+\eta)/\eta}\Big{]}^{\eta/(2+\eta)}.

E

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Principal Component Analysis for Multivariate Extremes

Holger Drees

University of Hamburg, Department of Mathematics, Germany

Anne Sabourin

LTCI, Télécom Paris, Institut polytechnique de Paris, France

Abstract

The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations.

Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection subspace. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.

Key words: Principal Component Analysis, Multivariate extreme value analysis, dimensionality reduction, Empirical Risk Minimization.

MSC primary 62G32; secondary 62H25.

1 Introduction

If one wants to analyze the tail behavior of an $\mathbb{R}^{d}$ -valued random vector $X=(X^{1},\ldots,X^{d})$ one usually assumes that $X$ is regularly varying (if necessary after a standardization of the marginal distributions), i.e. there exists a normalizing function $b$ and a non-zero measure $\mu$ on $\mathbb{R}^{d}\setminus\{0\}$ such that

[TABLE]

for all $\mu$ -continuous Borel sets $B$ that are bounded away from the origin. Equation (1.1) may be understood as a generalization to arbitrary dimension of a heavy-tail assumption regarding a real-valued random variable. This mathematical framework is particularly useful in situations where the focus is on ‘tail events’ of the kind $\{X\in B\}$ where the distance to the origin $u=\inf\{\|x\|:x\in B\}$ is large, for some norm $\|\cdot\|$ . In a risk management context, the probability of such tail events is of crucial importance. If the distance $u$ is so large that few or no data are available in the considered region, all attempts to resort to empirical estimation are in vain. One common idea behind statistical methods based on Extreme Value Theory (EVT) is to use a small proportion of the available data (those with a comparatively large norm) to learn an estimate for $\mu$ , to be used for quantifying the probability of tail events.

1.1 Regular Variation

A substantial reference concerning the probabilistic aspects of regular variation in the setting of EVT is Resnick, (2013), see also Resnick, (2007) for application-oriented examples. Regular variation for Borel measures on Polish spaces has since been revisited in Hult and Lindskog, (2006) and Lindskog et al., (2014). It is well known that if Equation (1.1) holds true, then the limit measure $\mu$ is homogeneous of order $-\alpha$ for some $\alpha>0$ . Moreover, the normalizing function $b$ and the norm $\|X\|$ are regularly varying, too: $b(tx)/b(t)\to x^{-\alpha}$ and $\mathbb{P}\{\|X\|>tx\}/\mathbb{P}\{\|X\|>t\}\to x^{-\alpha}$ as $t\to\infty$ for all $x>0$ . Here $\|\cdot\|$ may be any norm on $\mathbb{R}^{d}$ , but in what follows we only consider the Euclidean norm.

Because the limit measure is homogeneous, after a polar transformation, it can be decomposed into a so-called spectral (or angular) probability measure $H$ and an independent radial component, that is

[TABLE]

for all $r>0$ and all Borel subsets $A$ of the unit sphere, with $c:=\mu\{x\;:\;\|x\|>1\}$ . Whereas the literature is plentiful concerning the design and the asymptotics of flexible multivariate parametric or non-parametric models for $\mu$ or integrated versions of it (see e.g. Segers, (2012); Fougères et al., (2015); Einmahl et al., (2001); Genest and Segers, (2009); Rootzén et al., (2018), or Beirlant et al., (2006) and the references therein), the issue of how to escape the curse of dimensionality has only recently been raised (see below). One reason for this may be that a major application of EVT is environmental, spatial extremes such as heavy rainfalls, heat waves, droughts or floods. In this context, max-stable or generalized Pareto spatial models are widely used (Padoan et al., (2010); Ferreira and de Haan, (2014); Schlather, (2002)) which have built in a priori information about the spatial dependence structure, thus reducing the effective dimension.

1.2 Dimensionality reduction for extreme values, a brief overview

For applications such as e.g. anomaly detection or network monitoring where no particular structure is known a priori, dimensionality reduction suggests itself as a preliminary step before implementing any kind of learning procedure and the subject is recently receiving increasing attention. If $d$ is moderate or large, the measure $\mu$ (and hence $H$ ) will often exhibit some ‘sparse’ structure. For example, if some of the components of $X$ are asymptotically independent, i.e. for some index set $I\subset\{1,\ldots,d\}$ of size $|I|\in\{1,\ldots,d-1\}$

[TABLE]

then $\mu$ is concentrated on $\{x\in\mathbb{R}^{d}\mid\max_{i\in I}|x_{i}|=0\text{ or }\max_{i\not\in I}|x_{i}|=0\}$ . More generally, one may consider the case where only a small number of subsets of components $\{I_{k}\subset\{1,\ldots,d\},k=1,\ldots,K\}$ are likely to be large simultaneously, while the other components remain small. Here, ‘small number’ is understood relatively to the $2^{d}-1$ non empty possible subsets of components. This setting applies e.g. to heavy rainfalls in a spatial setting (storms are usually localized, so that neighboring sites are more likely to be concomitantly impacted) or of shocks over different assets of a financial portfolio. Chautru, (2015) proposes a clustering approach combined with spherical data analysis to detect structures of this type. Goix et al., (2016, 2017) propose an algorithm with moderate computational cost (linear in the dimension and the sample size) and finite sample uniform guarantees. Their error bounds are linear in $d$ and scale as $1/\sqrt{k}$ , where $k$ is the number of order statistics of each component which are considered extreme during the training step. A refinement of the latter framework is proposed in the yet unpublished work of Simpson et al., (2018). Chiapino and Sabourin, (2016) and Chiapino et al., (2019) aim at identifying subgroups of components for which the probability of a joint excess over a large quantile is not negligible compared to that of an excess by a single component. Engelke and Hitz, (2018) use graphical models to reduce the complexity of the extremal dependence structure. In a regression context, Gardes, (2018) sets up a mathematical framework for tail dimension reduction suited to the case where the distribution of the target variable above high thresholds only depends on the projection of the covariates on a lower dimensional subspace. Consistency of K-means clustering applied to the most extreme observations of a data set has recently been proven in the unpublished work of Janßen and Wan, (2019).

1.3 Principal component analysis (PCA) and support identification

Here we focus on finding a linear subspace on which $\mu$ is (nearly) concentrated. In a classical setting, when $\|X\|$ has finite second moments, PCA (Anderson, (1963)) is the method of choice to determine such supporting linear subspaces if $\mathit{i.i.d.}$ random vectors $X_{i}$ , $1\leq i\leq n$ , with the same distribution as $X$ are observed. Theoretical guarantees obtained so far concern the reconstruction error (Koltchinskii and Giné, (2000); Shawe-Taylor et al., (2005); Blanchard et al., (2007); Koltchinskii and Lounici, (2017)) or the approximation error for the eigenspaces of the covariance matrix (Zwald and Blanchard, (2006)), under the assumption that the sample space (or the feature space for Kernel-PCA) has finite diameter or that sufficiently high order moments exist.

For motivation of our version of PCA, it is useful to keep the following working hypothesis in mind, although it is not required for most results to hold:

Hypothesis 1.

The vector space $V_{0}=\operatorname{span}(\operatorname{supp}\mu)$ generated by the support of $\mu$ has dimension $p<d$ .

Note that then the points $(X_{i}/t)\bm{1}\{\|X_{i}\|>t\}$ are more and more concentrated on a neighborhood of $V_{0}$ as $t$ increases, but that usually they will not lie on $V_{0}$ . If the dimension $p$ of $V_{0}$ is known, then it suggests itself to approximate $V_{0}$ by the subspace of dimension $p$ which is ‘closest’ in expectation to these points.

In PCA one measures the closeness by the squared Euclidean distance which hugely alleviates the optimization problem as one may work with orthogonal projections in the Hilbert space $L_{2}$ . However, this approach assumes finite second moments that cannot be taken for granted in the above setting. Indeed, if $\alpha<2$ then $\operatorname{\mathbb{E}}(\|X_{i}\|^{2})=\infty$ . Hence, we will instead consider re-scaled vectors

[TABLE]

where $\omega:\mathbb{R}^{d}\to(0,\infty)$ is a suitable scaling function. The most common choice is $\omega(x)=1/\|x\|$ , leading to $\Theta_{i}$ on the unit sphere, and we will focus on this re-scaling when we derive finite sample bounds on the reconstruction error (see Section 3). However, consistency results will be proved for considerably more general scaling functions; cf. Section 2.

To the best of our knowledge, the only existing work considering PCA properly speaking for high dimensional extremes is the unpublished paper of Cooley and Thibaud, (2016). The authors discuss a transformation mapping negative observations to small positive ones and apply PCA in this transformed space. They also use a preliminary re-scaling involving the norm of the transformed vector. They illustrate their approach with simulations and real data examples, without deriving theoretical statistical guarantees.

1.4 Notation and risk minimization setting

To give a formal description of our method, we first introduce some notation. All random variables are defined on some probability space $(\mathcal{X},\mathcal{A},\mathbb{P})$ ; the expectation with respect to $\mathbb{P}$ is denoted by $\operatorname{\mathbb{E}}$ . For $x\in\mathbb{R}^{d}$ and $t>0$ , let

[TABLE]

By $P$ we denote the distribution of $X$ and by $P_{t}$ its conditional distribution given that $\|X\|>t$ , i.e. $P_{t}(\cdot)=\mathbb{P}(X\in\cdot\mid\|X\|>t\}$ . Then $P_{\infty}:=\mu/\mu((B_{1}(0))^{c})$ is the weak limit of $P_{t}(t\cdot)$ (with $B_{1}(0)$ denoting the closed unit ball); cf. (1.1).

For any probability measure $Q$ and any $Q$ -integrable function $f$ , we denote the expectation of $f$ with respect to $Q$ by $Qf$ or $Q(f)$ . By $\operatorname{\mathbb{E}}_{t}$ we denote the conditional expectation (with respect to $\mathbb{P}$ ) given $\|X\|>t$ so that $\operatorname{\mathbb{E}}_{t}(f(X))=P_{t}(f)$ , provided the expectations exist.

For any linear subspace $V\subset\mathbb{R}^{d}$ , let $\mathbf{\Pi}_{V}$ be the orthogonal projection onto $V$ (or the associated projection matrix), and let $\mathbf{\Pi}_{V}^{\perp}$ be the orthogonal projection onto the orthogonal complement $V^{\perp}$ of $V$ .

To apply PCA to the re-scaled vectors, we have to assume that the scaling function $\omega$ is chosen such that $\operatorname{\mathbb{E}}(\|\Theta\|^{2})=P(\|\theta\|^{2})<\infty$ and $P_{\infty}(\|\theta\|^{2})<\infty$ . Note that this condition is always fulfilled if there exist $\beta>1-\alpha/2$ and $c>0$ such that $\omega(x)\leq c\|x\|^{-\beta}$ for all $x\in\mathbb{R}^{d}$ . For simplicity’s sake, in what follows we will impose the following stronger homogeneity condition:

[TABLE]

where $\mathbb{S}^{d-1}:=\{x\in\mathbb{R}^{d}\,:\,\|x\|=1\}$ denotes the unit sphere. Note that then $\|\theta(x)\|\leq c_{\omega}\|x\|^{1-\beta}$ . The choice $\omega(x)=\|x\|^{-\beta}$ seems natural, but different choices allow for focusing on particular aspects of the extreme value behavior. For instance, if one is only interested in the positive components of $X$ , one may choose $\omega(x)=\|x\|^{-\beta}\bm{1}_{[0,\infty)^{d}}(x)$ .

Hypothesis 1 is equivalent to the statement that $\inf_{V:\dim(V)=p}R_{\infty}(V)=0$ and $\inf_{V:\dim(V)=p^{\prime}}R_{\infty}(V)>0$ for all $p^{\prime}<p$ where

[TABLE]

and the infima are taken over all linear subspaces of the specified dimension. The risk $R_{\infty}$ may be interpreted as the expected reconstruction error in the limit model if the re-scaled observation $\Theta$ is replaced with its lower dimensional approximation $\mathbf{\Pi}_{V}\Theta$ . Since $P_{t}(t\cdot)\to P_{\infty}(\cdot)$ weakly, one may approximate $V_{0}$ by a subspace $V_{t}^{*}=V_{t}^{p*}$ of dimension $p$ which minimizes the conditional risk

[TABLE]

given that $\|X\|$ exceeds a high threshold $t>0$ . Note that $V_{t}^{*}$ may be of interest even if Hypothesis 1 only holds approximately, in the sense that $P_{\infty}$ concentrates most of its mass on a small neighborhood of a $p$ -dimensional subspace.

It is natural to ‘estimate’ $V_{t}^{*}$ (and thus $V_{0}$ ) by a minimizer of the corresponding empirical risk

[TABLE]

Here the threshold $t$ must be chosen suitably, depending on the sample size. To this end, often order statistics of the norms of the observed vectors are used, and we follow this approach. Let $X_{(j)}=X_{\sigma(j)}$ where $\sigma$ is a permutation of indices such that $\|X_{(1)}\|\geq\|X_{(2)}\|\geq\dotsb\geq\|X_{(n)}\|$ . (For brevity, we suppress the dependence on $n$ in our notation of order statistics.) For $1\leq k\leq n$ , denote by $\hat{t}_{n,k}=\|X_{(k+1)}\|$ the empirical quantile of level $1-k/n$ for $\|X\|$ . We define the empirical risk for the subspace $V$ related to the $k$ largest observations as

[TABLE]

where $\Theta_{i,t}=\theta_{t}(X_{i})$ in accordance with the notation introduced in (1.4). A minimizer of $R_{n,k}(V)$ among all linear subspaces of dimension $p$ will be denoted by $\hat{V}_{n}=\hat{V}_{n}^{p}$ . It is the main goal of the present paper to analyze the asymptotic and the finite sample behavior of the empirical risk $R_{n,k}(V)$ and its minimizer $\hat{V}_{n}$ .

1.5 Outline

In Section 2 we will first show that the minimizer of the risk $R_{t}$ based on a finite threshold $t$ converges to the minimizer of the limit risk $R_{\infty}$ , and thus under Hypothesis 1 to $V_{0}$ , as $t\to\infty$ . Moreover, we show consistency of the empirical minimizer $\hat{V}_{n}$ under condition (1.5) with suitable $\beta$ . In Section 3, we derive non-asymptotic uniform bounds on $|R_{n,k}(V)-R_{t_{n,k}}(V)|$ and $|\hat{R}_{t}(V)-R_{t}(V)|$ for the most important scaling $\omega(x)=1/\|x\|$ . Furthermore, we construct uniform confidence bands for $R_{t}(V)$ . The results obtained in a simulation study are reported in Section 4. In particular, we explore the choice of the dimension $p$ based on empirical risk plots and the effect of a PCA projection on estimators of probabilities expressed in terms of the spectral measure $H$ . Finally, Section 5 contains some details about the proof of a modification of a result by Blanchard et al., (2007).

2 Consistency of risk minimizers

In this section we first discuss how to calculate minimizers of the conditional risk $R_{t}$ given $\|X\|>t$ and the empirical risk $R_{n,k}$ . Moreover, we prove that these converge in some sense towards a minimizer of $R_{\infty}$ .

It is well known that a point of minimum of $V\mapsto\operatorname{\mathbb{E}}\|\mathbf{\Pi}_{V}^{\perp}Y\|^{2}$ can be derived from the spectral analysis of the matrix of second (mixed) moments of $Y$ :

Lemma 2.1.

(i)

Let $Y$ be an $\mathbb{R}^{d}$ -valued random vector with $\operatorname{\mathbb{E}}(\|Y\|^{2})<\infty$ and $\Sigma:=\operatorname{\mathbb{E}}(YY^{\top})$ . Let $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{d}\geq 0$ denote the eigenvalues of $\Sigma$ with corresponding orthogonal eigenvectors $x_{1},\ldots,x_{d}$ . Then $V^{*}=\operatorname{span}(x_{1},\ldots,x_{p})$ minimizes $\operatorname{\mathbb{E}}(\|\mathbf{\Pi}_{V}^{\perp}Y\|^{2})$ among all linear subspaces $V$ of dimension $p$ . In the case $\lambda_{p}>\lambda_{p+1}$ it is the unique minimizer. 2. (ii)

If the scaling condition (1.5) holds and $\lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{d}\geq 0$ denote the eigenvalues of $\Sigma_{t}:=\operatorname{\mathbb{E}}_{t}(\Theta\Theta^{\top})$ with corresponding orthogonal eigenvectors $x_{1},\ldots,x_{d}$ , then $V^{*}=\operatorname{span}(x_{1},\ldots,x_{p})$ minimizes $R_{t}(V)$ among all linear subspaces $V$ of dimension $p$ . In the case $\lambda_{p}>\lambda_{p+1}$ it is the unique minimizer. 3. (iii)

If the scaling condition (1.5) holds and $\lambda_{n,1}\geq\lambda_{n,2}\geq\cdots\geq\lambda_{n,d}\geq 0$ denote the eigenvalues of $\Sigma_{n,k}:=k^{-1}\sum_{i=1}^{n}({\Theta}_{i,\hat{t}_{n,k}}{\Theta}_{i,\hat{t}_{n,k}}^{\top})$ with corresponding orthogonal eigenvectors $x_{n,1},\ldots,x_{n,d}$ , then $\hat{V}_{n}=\operatorname{span}(x_{n,1},\ldots,x_{n,p})$ minimizes $R_{n,k}(V)$ among all linear subspaces $V$ of dimension $p$ .

A proof of assertion (i) can e.g. be found in Seber, (1984), Theorem 5.3, where also other optimality properties of the minimizers are given. Both the other results follow directly by an application of (i) with $Y$ equal to $\Theta$ conditional on $\|X\|>t$ , respectively a random variable according to the empirical distribution of the ${\Theta}_{i}$ for which $\|X_{i}\|>\hat{t}_{n,k}$ . If $\lambda_{p}=\lambda_{p+1}$ , then the minimizer is not unique. With $m=\min\{i\in\{1,\ldots,p\}\;:\;\lambda_{i}=\lambda_{p}\}$ any minimizer $V_{t}^{*}$ of $R_{t}$ can be represented as $V_{t}^{*}=\operatorname{span}(x_{1},\ldots,x_{m-1},\tilde{x}_{m},\ldots,\tilde{x}_{p})$ where $\tilde{x}_{m},\ldots,\tilde{x}_{p}$ are orthogonal eigenvectors to the eigenvalue $\lambda_{p}$ and all these subspaces are minimizers. An analogous statement holds for the empirical risk.

Next we discuss the relationship between $R_{t}$ and $R_{\infty}$ and their respective minimizers. The convergence of the risks is an immediate consequence of the following simple lemma.

Lemma 2.2.

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a measurable function that is locally bounded, $P_{\infty}$ -a.e. continuous and satisfies $\limsup_{\|x\|\to\infty}|f(x)|\|x\|^{-\tilde{\alpha}}<\infty$ for some $\tilde{\alpha}<\alpha$ . Then $\lim_{t\to\infty}\int f(x/t)\,P_{t}(dx)=\int f(x)\,P_{\infty}(dx)$ .

Proof.

According to (1.1), $P_{t}(t\,\cdot\,)=\mathbb{P}(X\in t\,\cdot\,\mid\|X\|>t)\to\mu(\,\cdot\,)/\mu((B_{1}(0))^{c})=P_{\infty}(\,\cdot\,)$ weakly. Let $Y_{t}$ be a random vector with distribution $P_{t}(t\,\cdot\,)$ and $Y_{\infty}$ a random vector with distribution $P_{\infty}$ . Since $\int f(x/t)\,P_{t}(dx)=\operatorname{\mathbb{E}}f(Y_{t})$ , the assertion follows if the $f(Y_{t})$ are asymptotically uniformly integrable (see Van der Vaart, (2000), Theorem 2.20).

By assumption $f(Y_{t})$ can be bounded by a multiple of $1+\|Y_{t}\|^{\tilde{\alpha}}$ . Now, for all $\tau\in[0,\alpha)$ and $t\geq t_{0}$ for some sufficiently large $t_{0}$ , integration by parts, regular variation of $u\mapsto u^{\tau-1}\mathbb{P}\{\|X\|>u\}$ and Karamata’s theorem yield

[TABLE]

In particular, $\sup_{t\geq t_{0}}\operatorname{\mathbb{E}}\|Y_{t}\|^{\tilde{\alpha}(1+\varepsilon)}<\infty$ for $\varepsilon\in(0,\alpha/\tilde{\alpha}-1)$ , so that $\|Y_{t}\|^{\tilde{\alpha}}$ and thus $f(Y_{t})$ are asymptotically uniformly integrable. ∎

Corollary 2.3.

Suppose that $\omega$ fulfills condition (1.5). Then, for any subspace $V$ of $\mathbb{R}^{d}$ , the suitably standardized associated finite threshold risk converges:

[TABLE]

Proof.

Note that by the homogeneity of $\omega$ ,

[TABLE]

with $f(x):=\|\mathbf{\Pi}_{V}^{\perp}\theta(x)\|^{2}=\|\mathbf{\Pi}_{V}^{\perp}\omega(x)x\|^{2}\leq c_{\omega}^{2}\|x\|^{2(1-\beta)}$ . Since $2(1-\beta)<\alpha$ , Lemma 2.2 yields the assertion. ∎

In view of Corollary 2.3, one may ask whether a minimizer of $\tilde{R}_{t}:=t^{2(\beta-1)}R_{t}$ (which of course is also a minimizer of $R_{t}$ ) converges in some sense to a minimizer of $R_{\infty}$ . Denote by $\mathcal{V}_{p}$ the set of all subspaces of $\mathbb{R}^{d}$ of dimension $p$ , endowed with the metric $\rho(V,W)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\mathbf{\Pi}_{V}-\mathbf{\Pi}_{W}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\mathbf{\Pi}_{V}^{\perp}-\mathbf{\Pi}_{W}^{\perp}\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\sup_{x\in\mathbb{S}^{d-1}}\|\mathbf{\Pi}_{V}^{\perp}x-\mathbf{\Pi}_{W}^{\perp}x\|$ , where ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|\,\cdot\,\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ denotes the operator norm.

*Remark 2.4**.*

Note that $\rho(V,W)$ also gives an upper bound on the Hausdorff distance between $V\cap\mathbb{S}^{d-1}$ and $W\cap\mathbb{S}^{d-1}$ . To see this, let $x^{*}\in V\cap\mathbb{S}^{d-1}$ and $y^{*}\in W\cap\mathbb{S}^{d-1}$ be such that the Hausdorff distance equals $\inf_{y\in W\cap\mathbb{S}^{d-1}}\|x^{*}-y\|=\|x^{*}-y^{*}\|$ . Then $y^{*}=\mathbf{\Pi}_{W}x^{*}/\|\mathbf{\Pi}_{W}x^{*}\|$ , $\|x^{*}-\mathbf{\Pi}_{W}x^{*}\|\leq\rho(V,W)$ and $\|\mathbf{\Pi}_{W}x^{*}\|^{2}\geq 1-(\rho(V,W))^{2}$ . Hence

[TABLE]

Theorem 2.5.

Suppose that $\omega$ satisfies condition (1.5) and that $R_{\infty}$ has a unique minimizer $V_{\infty}^{*}$ in $\mathcal{V}_{p}$ . Then for any minimizer $V_{t}^{*}$ of $R_{t}$ in $\mathcal{V}_{p}$ one has

[TABLE]

The following lemma plays a crucial role in the proof of Theorem 2.5.

Lemma 2.6.

If $\omega$ satisfies condition (1.5), then for sufficiently large $t_{0}$ , the standardized risks $\tilde{R}_{t}=t^{2(\beta-1)}R_{t}$ , $t\geq t_{0}$ , are equicontinuous w.r.t. $\rho$ .

Proof.

First note that $\big{|}\|\mathbf{\Pi}_{V}^{\perp}\theta(x)\|-\|\mathbf{\Pi}_{W}^{\perp}\theta(x)\|\big{|}\leq\|\mathbf{\Pi}_{V}^{\perp}\theta(x)-\mathbf{\Pi}_{W}^{\perp}\theta(x)\|\leq\|\theta(x)\|\rho(V,W)\leq c_{\omega}\|x\|^{1-\beta}\rho(V,W)$ . Choose $t_{0}$ as in the proof of Lemma 2.2 and recall the definition of $Y_{t}$ given there. Then, by (2.1), for all subspaces $V,W$ of $\mathbb{R}^{d}$

[TABLE]

which proves the assertion. ∎

Proof of Theorem 2.5.

We first prove that $\mathcal{V}_{p}$ is compact w.r.t. $\rho$ . The assertion then follows by standard arguments using Lemma 2.6.

We have to show that any sequence $(V_{n})_{n\in\mathbb{N}}$ in $\mathcal{V}_{p}$ has a convergent subsequence. For each $n$ , let $(u_{1,n},\ldots,u_{p,n})$ be an orthonormal basis for $V_{n}$ so that $\mathbf{\Pi}_{V_{n}}x=U_{n}U_{n}^{\top}x$ where $U_{n}$ denotes the matrix with columns $u_{j,n}$ . The vectors $(u_{j,n})_{1\leq j\leq p}$ belong to the compact set $(\mathbb{S}^{d-1})^{p}$ . Thus there exists a subsequence $n_{\ell}$ such that $u_{j,n_{\ell}}\to u_{j}^{0}$ for all $1\leq j\leq p$ . Since for all $n$ , $\langle u_{j,n},u_{i,n}\rangle=\delta_{i,j}$ , we also have $\langle u_{j}^{0},u_{i}^{0}\rangle=\delta_{i,j}$ and the $u_{j}^{0},j\leq p$ , form an orthonormal family in $\mathbb{R}^{d}$ . Let $V^{0}$ be the space generated by the $u_{j}^{0}$ ’s and denote by $U_{0}$ the matrix with these columns. Then $V^{0}$ has dimension $p$ , i.e. $V^{0}\in\mathcal{V}_{p}$ , and by construction

[TABLE]

which proves the claimed compactness.

Now assume that the assertion of the theorem was wrong. By the compactness of $\mathcal{V}_{p}$ , then there exist a sequence $t_{n}\to\infty$ such that $V_{t_{n}}^{*}$ converges to some $V_{\infty}\neq V_{\infty}^{*}$ . By Lemma 2.6, $|\tilde{R}_{t_{n}}(V_{t_{n}}^{*})-\tilde{R}_{t_{n}}(V_{\infty})|\to 0$ , and by Corollary 2.3 $|\tilde{R}_{t_{n}}(V_{\infty})-R_{\infty}(V_{\infty})|\to 0$ and $|\tilde{R}_{t_{n}}(V_{\infty}^{*})-R_{\infty}(V_{\infty}^{*})|\to 0$ . Hence, for $\varepsilon:=R_{\infty}(V_{\infty})-R_{\infty}(V_{\infty}^{*})$ , that is strictly positive by assumption, and sufficiently large $n$ , one may conclude a contradiction:

[TABLE]

Therefore, the assertion must be correct. ∎

Under Hypothesis 1, $V_{0}$ is the unique minimizer of $R_{\infty}$ over $\mathcal{V}_{p}$ , that is if we minimize the risk over linear subspaces with the correct dimension, as the following result shows. Hence in this case, $V_{t}^{*}$ converges to $V_{0}$ .

Lemma 2.7.

Under Hypothesis 1, for any subspace $V\subset\mathbb{R}^{d}$ of arbitrary dimension one has

[TABLE]

Thus, $V_{0}$ is the unique minimizer of $R_{\infty}$ in $\mathcal{V}_{p}$ , whereas on $\mathcal{V}_{\tilde{p}}$ with $\tilde{p}>p$ the points of minimum of the limit risk $R_{\infty}$ are not unique.

Proof.

If $V_{0}\subset V$ then $V^{\perp}\subset V_{0}^{\perp}$ . By Hypothesis 1, $P_{\infty}$ is concentrated on $V_{0}$ , which implies $R_{\infty}(V)=P_{\infty}\|\mathbf{\Pi}_{V}^{\perp}{\theta}\|^{2}=0$ .

Conversely, if $R_{\infty}(V)=0$ , then $1=P_{\infty}\{\mathbf{\Pi}_{V}^{\perp}\theta=0\}=P_{\infty}(V)$ . By definition of $P_{\infty}$ and the homogeneity of $\mu$ , this means that the support of $\mu$ must be a subset of $V$ and thus $V_{0}\subset V$ . ∎

In the remaining part of this section, we will establish analogous consistency results for the empirical risk $R_{n,k}$ and its minimizer. In what follows, let $F_{\|X\|}$ be the c.d.f. of $\|X\|$ , $F^{\leftarrow}_{\|X\|}$ its generalized inverse (quantile function) and define

[TABLE]

We start with consistency of the standardized empirical risk.

Proposition 2.8.

If $\omega$ satisfies condition (1.5), then $t_{n,k}^{2(\beta-1)}R_{n,k}(V)\to R_{\infty}(V)$ in probability for all linear subspaces $V$ of $\mathbb{R}^{d}$ .

Proof.

For simplicity, we assume that $F_{\|X\|}$ is continuous in the tail (so that there are no ties among the observed norms), but the proof can be easily generalized using standard techniques from the theory of regular varying functions. First we want to replace the random threshold $\hat{t}_{n,k}$ with $t_{n,k}$ in the definition of $R_{n,k}$ . Observe that by the Hölder inequality

[TABLE]

where $\eta>0$ is chosen such that $(2+\eta)(1-\beta)<\alpha$ .

It is well known that $\hat{t}_{n,k}/t_{n,k}\to 1$ in probability. Thus there exists a sequence $\delta_{n}\downarrow 0$ such that $P\{\hat{t}_{n,k}>t_{n,k}(1-\delta_{n})\}\to 0$ . By (2.1) and the regular variation of $1-F_{\|X\|}$

[TABLE]

In particular, $k^{-1}\sum_{i=1}^{n}t_{n,k}^{(2+\eta)(\beta-1)}\|\Theta_{i}\|^{2+\eta}\bm{1}\{\|X_{i}\|>t_{n,k}\wedge\hat{t}_{n,k}\}$ is stochastically bounded.

Furthermore,

[TABLE]

because there exist exactly $k$ exceedances of $\hat{t}_{n,k}$ , and either all non-vanishing differences of the indicator functions equal 1 or all equal $-1$ , depending on whether $\hat{t}_{n,k}<t_{n,k}$ or $\hat{t}_{n,k}>t_{n,k}$ . Now, the last sum is binomially distributed with parameters $n$ and $k/n$ . By the central limit theorem for triangular arrays, the right hand side is of the stochastic order $k^{1/2}$ .

A combination of these results show that

[TABLE]

uniformly for all subspaces $V$ .

In view of Corollary 2.3, it thus suffices to show that

[TABLE]

in probability, with $d_{n,k}:=(\log k)t_{n,k}$ .

Let $\alpha^{*}:=4(1-\beta)\vee(\alpha+1)$ . By similar calculation as in the proof of Corollary 2.3

[TABLE]

Similarly as in the proof of Lemma 2.2, we can bound the expectation using integration by parts and Karamata’s theorem:

[TABLE]

for sufficiently large $n$ . Therefore, by the choice of $d_{n,k}$

[TABLE]

which implies the convergence in probability of $T_{n,1}$ .

For the second term, we may conclude from (2.1) and the definition of $t_{n,k}$ that

[TABLE]

Because $t\mapsto t^{2(1-\beta)}(1-F_{\|X\|}(t))$ is regularly varying with index $2(1-\beta)-\alpha<0$ and $t_{n,k}=o(d_{n,k})$ , the right hand side tends to 0. This proves that $T_{n,2}$ converges to 0 in probability, which concludes the proof. ∎

The following result is an analog to Lemma 2.6.

Lemma 2.9.

If $\omega$ satisfies condition (1.5), then for all $\varepsilon>0$ there exists $\delta>0$ such that for sufficiently large $n$

[TABLE]

Proof.

First note that in view of (2.2), it suffices to prove the assertion with $R_{n,k}(V)$ replaced by $k^{-1}\sum_{i=1}^{n}\|\mathbf{\Pi}_{V}^{\perp}\Theta_{i}\|^{2}\bm{1}\{\|X_{i}\|>t_{n,k}\}$ and $R_{n,k}(W)$ replaced by the analogous expression.

Similarly as in the proof of Lemma 2.6, we have

[TABLE]

for $n$ sufficiently large. Hence, by Markov’s inequality, (2.1) and the definition of $t_{n,k}$ ,

[TABLE]

for $\delta:=\varepsilon^{2}(\alpha-2(1-\beta))/\big{(}8(1-\beta)c_{\omega}^{2}\big{)}$ . ∎

We are now ready to prove weak consistency of the empirical risk minimizer.

Theorem 2.10.

If $\omega$ satisfies condition (1.5) and $R_{\infty}$ has a unique minimizer $V_{\infty}^{*}$ in $\mathcal{V}_{p}$ , then for all minimizers $\hat{V}_{n}$ of $R_{n,k}$ in $\mathcal{V}_{p}$ one has $\rho(\hat{V}_{n},V_{\infty}^{*})\to 0$ in probability.

Proof.

Let $\tilde{R}_{n,k}:=t_{n,k}^{2(\beta-1)}R_{n,k}$ . Fix an arbitrary $\varepsilon>0$ and let $\mathcal{M}:=\{W\in\mathcal{V}_{p}\mid\rho(V_{\infty}^{*},W)\geq\varepsilon/2\}$ . By the arguments used in the proof of Lemma 2.6, it is easily seen that $R_{\infty}$ is (Lipschitz) continuous w.r.t. $\rho$ . Moreover, $\mathcal{V}_{p}$ is compact (see proof of Theorem 2.5), and so is $\mathcal{M}$ . Hence, $\eta:=\inf_{W\in\mathcal{M}}R_{\infty}(W)-R_{\infty}(V_{\infty}^{*})>0$ , since the infimum is attained and $V_{\infty}^{*}$ is the unique minimizer of $R_{\infty}$ .

According to Lemma 2.9, there exists $\delta\leq\varepsilon/2$ and $n_{0}$ such that for all $n\geq n_{0}$ with probability greater than $1-\varepsilon/4$ one has

[TABLE]

for all $V,W\in\mathcal{V}_{p}$ such that $\rho(V,W)\leq\delta$ . Since $\mathcal{V}_{p}$ is compact, there exists a finite cover of $\mathcal{V}_{p}$ by open balls with radius $\delta$ and centers $W_{1},\ldots,W_{m}$ , say. By Proposition 2.8, there exists $n_{1}\geq n_{0}$ such that with probability greater than $1-\varepsilon/2$

[TABLE]

Hence, on a set with probability greater than $1-\varepsilon$ , there exists $j\in\{1,\ldots,m\}$ such that $\rho(\hat{V}_{n},W_{j})<\delta\leq\varepsilon/2$ and

[TABLE]

By the definition of $\eta$ , this implies $W_{j}\not\in\mathcal{M}$ and thus

[TABLE]

Since $\varepsilon>0$ was arbitrary, this concludes the proof. ∎

So far, we have proved weak consistency of both the standardized empirical risk and the empirical risk minimizer under mild assumptions on the scaling function $\omega$ . However, the rates of convergence may be arbitrarily slow. As condition (1.5) does not guarantee any finite moments of $\Theta$ of order greater than 1, it will not suffice to establish useful risk bounds. In the next section, we therefore analyze the recovery risk under the stronger assumption that $\Theta$ is bounded.

3 Uniform risk bounds

Since a minimizer $\hat{V}_{t}$ of the empirical risk $\hat{R}_{t}$ (or $\hat{V}_{n}$ of $R_{n,k}$ ) differs from the minimizer of the true risk $R_{t}$ , usually the so-called excess risk $R_{t}(\hat{V}_{t})-\inf_{V\in\mathcal{V}_{p}}R_{t}(V)$ will be strictly positive. We follow the common approach in the theory of risk minimization to bound the excess risk by deriving uniform bounds on $|\hat{R}_{t}(V)-R_{t}(V)|$ which hold with high probability for a fixed sample size $n$ . If these uniform bounds can be calculated from the observed data, they may also be used to construct confidence intervals for the reconstruction error $R_{t}(\hat{V}_{t})$ resp. $R_{t_{n,k}}(\hat{V}_{n})$ .

Since tight concentration inequalities are available only for subgaussian distributions, in this section we will assume that the scaling function $\omega$ satisfies the following condition:

[TABLE]

so that $\|\theta(x)\|\leq 1$ for all $x\in\mathbb{R}^{d}$ . Moreover, we suppose that the c.d.f. of $\|X\|$ is continuous in the tail to avoid technicalities. Then we may assume w.l.o.g. that there are no ties and thus exactly $k$ observations with norm larger than $\hat{t}_{n,k}$ .

For classical PCA (and a kernel version thereof), Shawe-Taylor et al., (2005) established uniform risk bounds for bounded random vectors $Z_{i}$ , which were improved by the following result by Blanchard et al., (2007). Assume $\|Z_{i}\|\leq 1$ , denote the empirical matrix of second (mixed) moments by $\hat{\Sigma}_{n}$ and the Hilbert-Schmidt norm on the space of matrices by $\|\,\cdot\,\|_{HS}$ . Then, with probability greater than $1-\delta$

[TABLE]

for all $V\in\mathcal{V}_{p}$ . One may try to derive uniform risk bounds in our extreme value setting by applying this result to the random variables $Z_{i}=\Theta_{i,t}=\Theta_{i}\bm{1}\{\|X_{i}\|>t\}$ , so that the left hand side is approximately equal to $\pi_{t}|\hat{R}_{t}(V)-R_{t}(V)|$ with

[TABLE]

if one ignores the difference between $N_{t}$ and its expectation $n\pi_{t}$ . In the case $\pi_{t}=o(n^{-1/2})$ , however, the above upper bound will not even converge to 0 when it is divided by $\pi_{t}$ because of the second term. Hence this direct approach does not give meaningful bounds for $|\hat{R}_{t}(V)-R_{t}(V)|$ .

The reason for this inconsistency is that, unlike in the classical setting, most of the random variables $Z_{i}$ will vanish as $t$ increases, and the concentration inequalities used in the proofs of the aforementioned bounds are too crude in such a situation. However, we will take up ideas used by Blanchard et al., (2007), with appropriate modifications, to derive much tighter uniform bounds on $|R_{n,k}(V)-R_{t_{n,k}}(V)|$ . Furthermore, we will derive uniform bounds on $|\hat{R}_{t}(V)-R_{t}(V)|$ which hold conditionally on $N_{t}=\ell$ and depend only on the data. These can then be used to construct confidence bands for $R_{t}(V)$ .

Before we establish these bounds, we first recall some well-known facts about Hilbert spaces specialized to the present setting, and introduce some notation. Let $(e_{i})_{1\leq i\leq d}$ be an arbitrary orthonormal basis of $\mathbb{R}^{d}$ and denote by $\langle\,\cdot\,,\,\cdot\,\rangle$ the usual inner product on $\mathbb{R}^{d}$ . The space of linear operators from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$ (i.e., $d\times d$ -matrices) equipped with the inner product $\langle A,B\rangle_{HS}:=\sum_{i=1}^{d}\langle Ae_{i},Be_{i}\rangle$ is a Hilbert space. The corresponding Hilbert Schmidt norm can be expressed as $\|A\|_{HS}=\big{(}\sum_{i=1}^{d}\|Ae_{i}\|^{2}\big{)}^{1/2}=\big{(}\operatorname{tr}(AA^{\top})\big{)}^{1/2}$ with $\operatorname{tr}$ denoting the trace operator. If, for any subspace $W$ of $\mathbb{R}^{d}$ , one chooses the first $\dim W$ vectors $e_{i}$ to form an orthonormal basis of $W$ , then one sees that

[TABLE]

Moreover, direct calculations show that

[TABLE]

Finally, for independent centered random matrices $A_{i}$ , $1\leq i\leq n$ , one has

[TABLE]

If, for the time being, one neglects the difference between the empirical and the true $(1-k/n)$ -quantile of $\|X\|$ , then $R_{n,k}(V)$ can be approximated by $\bar{R}_{t_{n,k}}(V)$ where

[TABLE]

Denote the empirical distribution of the observed random vectors $X_{i}$ , $1\leq i\leq n$ , by $P_{n}$ . For any threshold $t>0$ , the maximal difference between the approximate empirical risk $\bar{R}_{t}(V)$ and the true risk $R_{t}(V)$ can be rewritten as

[TABLE]

with

[TABLE]

For brevity’s sake, in what follows we use the notation $x_{i:j}:=(x_{i},\ldots,x_{j})$ for a subvector of $(x_{1},\ldots,x_{n})$ .

In order to derive uniform bounds on the difference between the empirical and the true risk, we first establish concentration inequalities for $\varphi_{t}^{\pm}(X_{1},\ldots,X_{n})$ using a version of the bounded difference inequality by (McDiarmid,, 1998, Theorem 3.8), which we recall for convenience.

Theorem 3.1.

Let $X_{1:n}=(X_{1},\ldots,X_{n})$ be an $\mathit{i.i.d.}$ sample taking its values in some space $E$ and $\varphi:{E}^{n}\to\mathbb{R}$ be any measurable function. Consider the positive deviation functions, defined for $1\leq m\leq n$ and for ${x}_{1:m}\in{E}^{m}$ ,

[TABLE]

Denote their maximum by

[TABLE]

and the maximal summed variance by

[TABLE]

If both $\textrm{maxdev}^{+}$ and $\hat{v}$ are finite, then for all $u\geq 0$ ,

[TABLE]

Lemma 3.2.

For all $u>0$ ,

[TABLE]

Proof.

The assertion follows immediately from Theorem 3.1 applied to $\varphi_{t}^{\pm}$ and the following bounds:

[TABLE]

and

[TABLE]

∎

The expectation $\operatorname{\mathbb{E}}\varphi_{t}^{\pm}(X_{1},\ldots,X_{n})$ can be bounded using arguments from Blanchard et al., (2007).

Lemma 3.3.

[TABLE]

with $\Sigma_{t}:=\operatorname{\mathbb{E}}_{t}\Theta\Theta^{\top}$ .

Proof.

Since, by (3.3), $\|\mathbf{\Pi}_{W}x\|^{2}=\langle\mathbf{\Pi}_{W}x,x\rangle=\langle\mathbf{\Pi}_{W},xx^{\top}\rangle_{HS}$ for any linear subspace $W$ and any $x\in\mathbb{R}^{d}$ , using the bilinearity of the inner product and the Cauchy-Schwarz inequality in the Hilbert-Schmidt space, we obtain

[TABLE]

Using (3.2) and taking the supremum over all $V\in\mathcal{V}_{p}$ and the expectation, one arrives at

[TABLE]

One the other hand, by first rewriting $\|\mathbf{\Pi}_{V}^{\perp}\theta_{t}\|^{2}=\|\theta_{t}\|^{2}-\|\mathbf{\Pi}_{V}\theta_{t}\|^{2}$ , analogously one obtains

[TABLE]

Now, by the Cauchy-Schwarz inequality and (3.4),

[TABLE]

Combining this with (3.8) and (3.9), we arrive at

[TABLE]

It remains to show that $\operatorname{\mathbb{E}}\|\Theta_{t}\Theta_{t}^{\top}-\operatorname{\mathbb{E}}\Theta_{t}\Theta_{t}^{\top}\|^{2}_{HS}=\pi_{t}\big{(}\operatorname{\mathbb{E}}_{t}\|\Theta\|^{4}-\pi_{t}\operatorname{tr}(\Sigma_{t}^{2})\big{)}$ . From the representation of the Hilbert Schmidt norm by the trace operator and the linearity of the latter, one may conclude by direct calculations that

[TABLE]

Hence the assertion follows from

[TABLE]

with $e_{j}$ denoting the $j$ th unit vector. ∎

Now we are ready to state our first uniform risk bound. Recall that $\Sigma_{t}:=\operatorname{\mathbb{E}}_{t}(\Theta\Theta^{\top})$ .

Theorem 3.4.

For all $u,v>0$ one has

[TABLE]

with $S_{t}:=\operatorname{\mathbb{E}}_{t}\|\Theta\|^{4}-\pi_{t}\operatorname{tr}(\Sigma_{t}^{2})$ .

In particular, with probability greater than or equal to $1-\delta$

[TABLE]

Note that (3.11) also implies an upper bound on the excess risk:

[TABLE]

Proof.

With $\bar{R}_{t}(V)$ defined in (3.5), we have

[TABLE]

By similar arguments as in the proof of Proposition 2.8, we see that, for all $V\in\mathcal{V}_{p}$ ,

[TABLE]

By Bernstein’s inequality, it follows that

[TABLE]

For the second term $\sup_{V\in\mathcal{V}_{p}}|\bar{R}_{t_{n,k}}(V)-R_{t_{n,k}}(V)|=\frac{n}{k}\max(\varphi^{+}_{t_{n,k}}(X_{1:n}),\varphi^{-}_{t_{n,k}}(X_{1:n}))$ , Lemma 3.2, Lemma 3.3 and $\pi_{t_{n,k}}=k/n$ immediately yield

[TABLE]

which concludes the proof of the first assertion.

Check that for

[TABLE]

both exponential expressions on the right hand side of (3.10) equal $\delta/4$ , and so the upper bound equals $\delta$ . Hence the remaining assertions follow from $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ . ∎

*Remark 3.5**.*

In the case $\omega(x)=\|x\|$ , the upper bound in (3.11) simplifies to

[TABLE]

Note that the upper bound in Theorem 3.4 cannot be calculated from the data and can thus not directly be used to construct confidence intervals for the true reconstruction error $R_{t_{n,k}}(\hat{V}_{n})$ or the minimal reconstruction error $\inf_{V\in\mathcal{V}_{p}}R_{t_{n,k}}(V)$ . Next, we derive data-dependent bounds directly from (a minor improvement of) the bound established by Blanchard et al., (2007). However, this result will be applied to the conditional distribution of $\Theta$ given $\|X\|>t$ and the resulting bound is to be interpreted conditional on the number $N_{t}$ of exceedances over the chosen threshold $t$ .

Theorem 3.6.

For all $\ell>1,u,v>0$ ,

[TABLE]

with $\tilde{S}_{t}:=N_{t}^{-1}\sum_{i=1}^{n}\|\Theta_{i,t}\|^{4}-\operatorname{tr}\Big{(}(N_{t}^{-1}\sum_{i=1}^{n}\Theta_{i,t}\Theta_{i,t}^{\top})^{2}\Big{)}$ and $\lfloor{x}\rfloor:=\max\{k\in\mathbb{Z}\mid k\leq x\}$ .

If, for all $\ell>1$ , constants $u_{\ell},v_{\ell}>0$ are chosen such that $2\exp\big{(}-2\ell u_{\ell}^{2})+\exp\big{(}-\lfloor{\ell/2}\rfloor v_{\ell}^{2}/2\big{)}=1-\alpha$ , then $I_{\ell}(V):=\big{[}\hat{R}_{t}(V)-B_{t,\ell},\hat{R}_{t}(V)+B_{t,\ell}\big{]}\cap[0,\infty)$ with

[TABLE]

defines a uniform level $\alpha$ confidence band for $R_{t}(V)$ , $V\in\mathcal{V}_{p}$ , conditionally on $N_{t}=\ell$ . If one defines $I_{0}(V)=I_{1}(V)=[0,\infty)$ , then $I_{N_{t}}(V)$ defines a uniform level $\alpha$ confidence band for $R_{t}(V)$ , $V\in\mathcal{V}_{p}$ (unconditionally).

Proof.

Define $\mathit{i.i.d.}$ random vectors $Z_{i}$ whose distribution equals the conditional distribution of $\Theta$ given $\|X\|>t$ . Recall that $\Theta_{(i)}:=\theta(X_{(i)})$ where $X_{(i)}$ is the vector $X_{j}$ with the $i$ th largest norm among $X_{1},\ldots,X_{n}$ . Then, conditionally on $N_{t}=\ell$ , the joint distribution of the empirical risk $\hat{R}_{t}(V)$ and $\Theta_{(1)},\ldots\Theta_{(\ell)}$ equals the joint distribution of $\ell^{-1}\sum_{i=1}^{\ell}\|\mathbf{\Pi}_{V}^{\perp}Z_{i}\|^{2}$ and the order statistics of $Z_{1},\ldots,Z_{\ell}$ . Therefore, the proof of Theorem 3.1 of Blanchard et al., (2007) (with $M=1$ and $L=2$ ) combined with arguments given in the proof of Lemma 3.3 show that

[TABLE]

Since the proof of Theorem 3.1 of Blanchard et al., (2007) is quite tersely formulated in a more abstract setting and contains a minor inaccuracy, for convenience we give more details of the proof of (3.12) in the Appendix.

In the same way as in the proof of Lemma 3.3, the first assertion thus follows from

[TABLE]

where in the last step we have used that, on $\{N_{t}=\ell\}$ , the set of non-vanishing vectors $\Theta_{i,t}$ equals the set of non-vanishing random vectors $\Theta_{(i)}$ .

The remaining assertions are now obvious. ∎

*Remark 3.7**.*

In the statement about the confidence bands one may replace $B_{t,\ell}$ with

[TABLE]

This half width of a confidence band is more suitable for (numerical) minimization (as a function of $u_{\ell}$ and $v_{\ell}$ ) under the constraint $2\exp\big{(}-2\ell u_{\ell}^{2})+\exp(-\lfloor{\ell/2}\rfloor v_{\ell}^{2}/2)=1-\alpha$ .∎

*Remark 3.8**.*

The (modified) proof of Theorem 3.1 of Blanchard et al., (2007) also shows that

[TABLE]

with $S_{t}^{*}:=\operatorname{\mathbb{E}}_{t}\|\Theta\|^{4}-\operatorname{tr}(\Sigma_{t}^{2})$ . Observe that $\bar{R}_{t}(V)=N_{t}\hat{R}_{t}(V)/(n\pi_{t})$ . On the set $M_{t}(v):=\{|N_{t}-n\pi_{t}|\leq n\pi_{t}v\}$ , one thus has

[TABLE]

since $\hat{R}_{t}(V)\leq 1$ . Moreover, for $t=t_{n,k}$ , it has been shown in the proof of Theorem 3.4 that $\sup_{V\in\mathcal{V}_{p}}|R_{n,k}(V)-\bar{R}_{t_{n,k}}(V)|\leq v$ on the set $M_{t_{n,k}}=\{N_{t_{n,k}}\in[k(1-v),k(1+v)]\}$ and that $\mathbb{P}(M_{t_{n,k}}^{c})\leq 2\exp\big{(}-kv^{2}/(2(1+v/3))\big{)}$ . Hence,

[TABLE]

A comparison with Theorem 3.4 reveals that the new bound may be tighter if $S_{t_{n,k}}^{*}$ is substantially smaller than $S_{t_{n,k}}$ . This will be the case if $k/n$ is small and $\operatorname{tr}\big{(}(\operatorname{\mathbb{E}}_{t}\Theta\Theta^{\top})^{2}\big{)}$ is not much smaller than $\operatorname{\mathbb{E}}_{t}\|\Theta\|^{4}$ . ∎

So far, we have compared empirical risks with the true risk $R_{t}$ for finite thresholds $t$ . A comparison with the limit risk $R_{\infty}$ would require second order refinements of our basic assumption (1.1). Let $\Sigma_{t}:=\operatorname{\mathbb{E}}_{t}\Theta\Theta^{\top}=P_{t}(\theta\theta^{\top})$ and $\Sigma_{\infty}=P_{\infty}(\theta\theta^{\top})$ . Denote the eigenvalues of $\Sigma_{t}-\Sigma_{\infty}$ by $\lambda_{t,1}^{\Delta}\geq\lambda_{t,2}^{\Delta}\geq\ldots\geq\lambda_{t,n}^{\Delta}$ . Then standard calculations from classical PCA show that

[TABLE]

where the second supremum is taken over all $(d\times(d-p))$ -matrices with orthogonal columns. Likewise, $\sup_{V\in\mathcal{V}_{p}}R_{\infty}(V)-R_{t}(V)=-\sum_{i=1}^{d-p}\lambda^{\Delta}_{t,d+1-i}$ and hence

[TABLE]

Therefore, bounds on the difference between empirical risks and the limit risk require additional assumptions on the spectrum of the difference $\Sigma_{t}-\Sigma_{\infty}$ between the matrix of second moments for the re-scaled exceedances over the threshold $t$ and the corresponding matrix in the limit model.

If one merely wants to compare the minimum risk for finite thresholds with the minimum limit risk, which equal the sums of $d-p$ smallest eigenvalues of $\Sigma_{t}$ resp. $\Sigma_{\infty}$ , then somewhat weaker assumptions on the convergence of the spectrum of $\Sigma_{t}$ and $\Sigma_{\infty}$ are needed. In particular, under Hypothesis 1, $\inf_{V\in\mathcal{V}_{p}}R_{t}(V)-\inf_{V\in\mathcal{V}_{p}}R_{\infty}(V)$ equals the sum of the smallest $d-p$ eigenvalues of $\Sigma_{t}$ .

4 Simulation study

We investigate the performance of our PCA procedure. In particular, we examine how the standard non-parametric estimator of the spectral measure (defined via (1.2)) based on the $k$ largest observations

[TABLE]

(with $\theta(x)=x/\|x\|$ ) is influenced if the data is first projected onto a lower dimensional subspace using PCA:

[TABLE]

Here, $\delta_{y}$ is the Dirac measure with point mass at $y$ and $V$ denotes the subspace picked by PCA based on the same number $k$ of largest observations. It will turn out that sometimes it is advisable to use a smaller number $\tilde{k}$ for the PCA procedure; the resulting estimator of the spectral measure will be denoted by $\hat{H}_{n,k,\tilde{k}}^{PCA}$ .

To measure the performance of the spectral estimators, we consider the resulting estimators of the following probabilities in the limit model, that can be expressed in terms of the spectral measure:

(i)

$\lim_{u\to\infty}\mathbb{P}(p^{-1}\sum_{1\leq j\leq p}X^{j}/\|X\|>t_{(i)}\mid\|X\|>u)=H\{x\mid p^{-1}\sum_{j=1}^{p}x^{j}>t_{(i)}\}$ for some $t_{(i)}\in(0,p^{-1/2})$

(ii)

$\lim_{u\to\infty}\mathbb{P}(\min_{1\leq j\leq p}X^{j}>u,\max_{p+1\leq j\leq d}X^{j}\leq u\mid\|X\|>u)=$

$\int\big{(}(\min_{1\leq j\leq p}x^{j})^{\alpha}-(\max_{p+1\leq j\leq d}x^{j})^{\alpha}\big{)}^{+}\,H(dx)$

(iii)

$\lim_{u\to\infty}\mathbb{P}(X^{1}>u\mid\max_{1\leq j\leq d}X^{j}>u)=\int(x^{1})^{\alpha}\,H(dx)/\int(\max_{1\leq j\leq p}x^{j})^{\alpha}\,H(dx)$

(iv)

$\lim_{u\to\infty}\mathbb{P}(\min_{1\leq j\leq d}X^{j}>u\mid\|X\|>u)=\int(\min_{1\leq j\leq d}x^{j})^{\alpha}\,H(dx)$

The first probability is related to the cdf of the mean contribution of the first $p$ coordinates to the norm of the random vector, thus quantifying, in some sense, how strongly the norm is spread over the coordinates. Probability (ii) indicates how likely it is that the first $p$ components are all large, while this is not true for any of the other components, given that the norm of the vector is large. Probability (iii) specifies how likely it is that the first component is extreme, given that any component is extreme. In a financial context, such probabilities are used to quantify how strongly a specific market participant is exposed to a failure of any market participant. Finally, probability (iv) specifies the minimal contribution of any coordinate to the norm. Note that under Hypothesis 1 this probability equals 0. The other true values are determined by Monte Carlo simulations with sample size of at least $10^{7}$ , unless they can be easily calculated analytically; the approximation error is always smaller than $10^{-3}$ . Throughout, we assume $\alpha$ to be known since we are interested in the effect of the PCA procedure on the estimator of the spectral measure, which should not be compounded with the estimation error of the tail index.

We consider different models of $d$ -dimensional regularly varying vectors for which the spectral measure is (approximately) concentrated on a $p$ -dimensional subspace. Since PCA is equivariant under rotations, w.l.o.g. we may assume that this subspace is spanned by the first $p$ unit vectors.

Two different models for the extreme value dependence structure between the first $p$ coordinates of the vector are investigated. First, we consider the so-called Dirichlet model; see, for instance, Segers, (2012), Ex. 3.6, where also a simple algorithm is given to simulate vectors with such an extremal dependence structure. Second, we simulate random vectors with a Gumbel copula $C_{\vartheta}(x)=\exp\big{(}-\big{(}\sum_{i=1}^{p}(-\log x_{i})^{\vartheta}\big{)}^{1/\vartheta}\big{)}$ , using the transformation method proposed by Stephenson, (2003). The marginal distributions are chosen as a Fréchet distribution with cdf $\exp(-x^{-\alpha})$ , $\alpha\in\{1,2\}$ .

In addition, we have simulated observations from a Dirichlet model which are then rotated in the plane spanned by two randomly chosen coordinates, one of them among the first $p$ coordinates, the other among the last $d-p$ . The rotation angle is uniformly distributed on the interval $[-\pi/10,\pi/10]$ . Note that, unlike in the first two models, Hypothesis 1 is not fulfilled here which allows to evaluate how sensitive PCA is to moderate deviations from this ideal situation.

In all cases, we add the modulus of a $d$ -dimensional multivariate normal vector with suitable variances and constant correlations $0.2$ . This way, it is ensured that the support of the exceedances over high thresholds is not fully concentrated on the $p$ -dimensional subspace. The variances are chosen equal to $10^{5}/d$ for $\alpha=1$ (i.e., if we start with unit Fréchet margins) and equal to $10/d$ for $\alpha=2$ , so that the sparsity assumption becomes apparent for the most extreme observations, whereas large yet less extreme data points are more spread out.

In all settings, we simulate samples of size $n=1000$ and examine the performance of the PCA procedure based on $\Theta=X/\|X\|$ for the $k$ vectors with largest norms for $k\in\{5,10,15,\ldots,200\}$ . The results reported here are based on 1000 simulations in each setting.

We first discuss the simulation results for the Dirichlet model with all Dirichlet parameters $\alpha_{i}$ , $1\leq i\leq p$ , equal to 3 and unit Fréchet margins. Figure 1 shows the mean empirical risk in the left plot as a function of $k$ for the PCA which projects onto a $\tilde{p}$ -dimensional subspace with $1\leq\tilde{p}\leq 10$ ; here the true $p$ equals 2 and the vectors have dimension $d=10$ . Since the mean empirical risk cannot be observed if one analyzes a given data set, the right plot shows the corresponding empirical risk for a single data set. The structure of both plots is very similar: essentially, the mean empirical risk curves are just a bit smoother. For this reason, in the remaining settings, we will only report the mean empirical risk.

It is obvious from the risk plot that $\tilde{p}=2$ is a good choice, since there is a big gap to the empirical risk for $\tilde{p}=1$ , whereas the empirical risk almost vanishes for small $k$ and $\tilde{p}=2$ , and the risk decreases more regularly for values $\tilde{p}>2$ , with no obvious structural breaks. The growing influence of the multivariate normal component as $k$ increases is manifest in these plots, since the empirical risk quickly increases with $k$ for all choices of $\tilde{p}$ . This suggests to choose $k$ rather small to detect the sparsity in the model, a finding which will be corroborated in the analysis of the estimator of the spectral measure below.

In Figure 2, the mean operator norm of the difference between the projection onto the true support of the limit measure $\mu$ and the projection onto the subspace of dimension 2 chosen by PCA is plotted versus $k$ . Again it becomes obvious that for less extreme observations the approximation by a lower-dimensional vector is rather poor, which leads to a larger error for the projection matrix estimated from these data. For $k=80$ , the norm has almost reached its maximal value. However, one should keep in mind that the operator norm measures the maximal distance between the projection of some vector $y\in\mathbb{S}^{d-1}$ onto the estimated respectively the true subspace. If the underlying distribution of $X/\|X\|$ puts little mass on vectors $y$ for which the distance is large, the true risk corresponding to the estimated subspace may still be small.

Next we consider the estimators of the probabilities (i)–(iv), obtained by replacing the spectral measure $H$ either with $\hat{H}_{n,k}$ or $\hat{H}_{n,k}^{PCA}$ . Since the PCA estimator of the subspace supporting $\mu$ quickly deteriorates as $k$ increases, in addition we consider the estimators resulting from $\hat{H}_{n,k,10}^{PCA}$ , that uses just the largest 10 observations to estimate the supporting subspace.

Figure 3 displays the root mean squared errors (RMSE) of the resulting estimators as a function of $k$ . For very small values of $k$ , all estimators perform similarly. For probability (i) with $t_{(i)}=0.65$ (leading to a true value of about 0.684), both PCA based estimators have a considerably smaller RMSE than the standard estimator for most $k$ . In particular, the PCA based method using just 10 largest observations to estimate the support of the spectral measure clearly outperforms both other estimators (almost) irrespective of the number of observations used for estimation of the spectral measure.

For the estimation of probability (ii) ( $\approx 0.309$ ), the standard non-parametric estimator performs best for $k\leq 40$ . The classical PCA using the same number of order statistics in both steps performs better for larger values of $k$ and its minimum RMSE is a bit lower than that of the standard estimator. The PCA based estimator which determines the support of $\mu$ from the largest 10 observations has a very stable RMSE, but its minimum is much larger than that both of the other estimators.

In case (iii) (with true value of about $0.770$ ), the RMSE of the standard estimator and the estimator based on $\hat{H}_{n,k,10}^{PCA}$ are very similar for $k$ up to about 80, but the latter is remarkably insensitive to the choice of $k$ up to 200. This feature might be useful in practical applications where the selection of $k$ is often tricky. In contrast, the PCA based procedure which uses the same number of largest observations in both steps is even more sensitive to this choice than the standard estimator.

Similarly, the classical PCA estimator of probability (iv) strongly depends on the choice of $k$ while both other estimators stably have a very low error.

Next, we consider the Dirichlet model with total dimension $d=100$ when the limit measure is concentrated on a $p=5$ dimensional subspace. Figure 4 shows the mean empirical risk for PCA projecting on a $\tilde{p}\in\{1,\ldots,10\}$ dimensional subspace in the left plot and the mean operator norm of the difference between the estimated and the true projection matrix in the right plot. The empirical risk suggests to choose $\tilde{p}$ between 4 and 6 and $k$ not much larger than 50 for estimating the support of the limit measure.

Figure 5 shows the RMSE of the different estimators of the probabilities (i)–(iv) with $t_{(i)}=0.4$ and true values $0.573,0.072,0.584$ and 0, respectively. Here, we have used PCA with $\tilde{p}=4$ in the upper row, $\tilde{p}=5$ in the mid row and $\tilde{p}=6$ in the lower row. As expected, in most cases the PCA procedures perform worse when they project on too low dimensional subspaces, yet in the cases (i) and (iv) the differences are moderate. At first glance somewhat surprisingly, overall the PCA procedures exhibit a better behavior for $\tilde{p}=6$ than for the “correct” value $\tilde{p}=5$ . This may be explained by the fact that the extra dimension offers the opportunity to compensate for the difference between the subspaces minimizing the true resp. the empirical risk. This difference is expected to be larger if the dimension of the observed vectors is large, as can also be seen from the right plot in Figure 4.

Again, the PCA based estimators for probability (i) outperform the standard procedure, but the other probabilities are more accurately estimated by the standard procedures if $\tilde{p}\leq 5$ (though all estimators of (iv) perform reasonably well). For $\tilde{p}=6$ , the RMSE of both variants of PCA based estimators of (ii) are very similar with a minimum value that is somewhat smaller than the minimum RMSE of the standard estimator. The performance of the standard estimator and the one based on classical PCA are almost identical for the probability (iii), while the estimator with PCA based on just $k=10$ largest observations is less accurate, probably because it is difficult to estimate a subspace of dimension 6 based on just 10 observations. It might help to increase the number of largest observations used to estimate the supporting subspace with the dimension $d$ , but we do not explore this idea here in order not to overload the presentation.

The mean empirical risk and the mean operator norm of the difference matrix are shown for the Gumbel copula with $\vartheta=2$ , $d=10$ and $p=2$ in Figure 6. Here, we have chosen Fréchet marginal distributions with cdf $F(x)=\exp(-x^{-2})$ , $x>0$ . Based on the left plot, one may choose $\tilde{p}=2$ , or perhaps $\tilde{p}=3$ .

Figure 7 displays the RMSE of the estimators of (i)–(iv) with PCA projecting on two-dimensional subspaces. Here $t_{(i)}=0.7$ and the true values for (i)–(iv) are 0.3813, $0.083$ , $1/\sqrt{2}$ and 0. Now the PCA which uses the same number in both steps performs worse than the standard estimator for probability (i), better for (ii) and very similar to the standard procedure for (iii) and $k\leq 100$ . The PCA estimator which uses just 10 largest observations for estimating the support again outperforms the standard procedure for probability (i) and (ii), whereas it is has a slightly larger RMSE for the probability (iii). In any case, as in the Dirichlet model, its RMSE is rather insensitive to the choice of $k$ . If one chooses $\tilde{p}=3$ (plots not shown here), then the classical PCA has almost the same RMSE as the standard procedure for the probabilities (i) and (iii). The same holds true for the estimator based on $\hat{H}_{n,k,10}^{PCA}$ for (iii) and (iv), while for (i) this estimator is still considerably more accurate than the standard procedure (though less so than for $\tilde{p}=2$ ) and both PCA procedures still outperform the standard estimator for probability (ii).

For the high-dimensional Gumbel model with $d=100$ and $p=5$ , by and large the findings are similar to the ones observed for the Dirichlet model so that we do not show the corresponding plots. However, in this model $\tilde{p}=4$ can be ruled out by the empirical risk plot and both PCA based estimators outperform the standard estimator of (ii).

Finally, we turn to the disturbed Dirichlet model with $d=10$ and $p=2$ where the observations are randomly rotated by an angular up to $\pi/10$ , leading to true values for (i)–(iv) of 0.653 (with $t_{(i)}=0.65$ ), 0.185, 0.770 and 0. The corresponding plots are shown in the Figures 8 and 9. In view of the empirical risk, the choices $\tilde{p}\in\{2,3\}$ seem reasonable.

Again, the PCA procedure which uses the same largest observations in both steps performs better for the larger choice of $\tilde{p}$ , whereas the performance of the other PCA procedure improves only for (ii), while it does not change much for (iii) and it deteriorates a bit for (i) and (iv). The PCA estimators perform better than the standard procedure for probability (i) and for (iii) if $k$ is large (for classical PCA only if $\tilde{p}=3$ ), whereas for (ii), roughly speaking, overall the estimators perform similarly well with the standard procedure performing better for small values of $k$ and the PCA estimators for larger values.

To sum up, the plot of the empirical risk seems a useful tool to choose the dimension of the subspace onto which the PCA procedure projects. In particular, for the PCA method which uses the same number of largest observations to estimate the support and to calculate the estimator of the spectral measure, in case of doubt it seems advisable to project onto a subspace of higher dimension. While the PCA step does not always improve the estimator of the spectral measure, in most cases the resulting estimators seem competitive with the standard ones if $\tilde{p}$ is chosen appropriately, and for probability (i) they are superior. In particular the PCA estimators which determine the support based only on the largest 10 observations often exhibit a desirable insensitivity to the choice of largest observations used to estimate the spectral measure.

5 Appendix: Details of the proof of (3.12)

Recall that $Z_{i}$ are iid random variables whose distribution equals the conditional distribution of $\Theta$ given $\|X\|>t$ . Let

[TABLE]

First note that

[TABLE]

for all $z,\tilde{z}\in B_{1}(0)$ . Thus a simple version of the bounded difference inequality (see, e.g., Theorem 3.1 of McDiarmid, (1998)) gives

[TABLE]

Exactly in the same way as in the proof of Lemma 3.3, one obtains

[TABLE]

Let $\tilde{Z}$ be an independent copy of $Z$ . Then

[TABLE]

with

[TABLE]

To sum up, so far we have shown that

[TABLE]

Next consider the U-statistic $U:=(\ell(\ell-1))^{-1}\sum_{i,j=1}^{\ell}g(Z_{i},Z_{j})$ with

[TABLE]

By equation (5.7) of Hoeffding, (1963), one has

[TABLE]

with $\operatorname{\mathbb{E}}U=\operatorname{\mathbb{E}}\|ZZ^{\top}-\tilde{Z}\tilde{Z}^{\top}\|_{HS}^{2}$ . Hence,

[TABLE]

This, however, is equivalent to (3.12), because the joint distribution of $\max\big{(}\phi^{+}(Z_{1:\ell}),$ $\phi^{-}(Z_{1:\ell})\big{)}$ and $\sum_{i,j=1}^{\ell}\|Z_{i}Z_{i}^{\top}-Z_{j}Z_{j}^{\top}\|_{HS}^{2}$ is the same as the joint conditional distribution of $\sup_{V\in\mathcal{V}_{p}}|\hat{R}_{t}(V)-R_{t}(V)|$ and $\sum_{i,j=1}^{\ell}\|\Theta_{(i)}\Theta_{(i)}^{\top}-\Theta_{(j)}\Theta_{(j)}^{\top}\|_{HS}^{2}$ , given $N_{t}=\ell$ .

Acknowledgements: Holger Drees was partly supported by DFG grant DR271/6-2. Anne Sabourin was partly supported by the Chaire ‘Stress testing’ from École Polytechnique and BNP Paribas.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson, (1963) Anderson, T. W. (1963). Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics , 34(1):122–148.
2Beirlant et al., (2006) Beirlant, J., Goegebeur, Y., Segers, J., and Teugels, J. L. (2006). Statistics of extremes: theory and applications . John Wiley & Sons.
3Blanchard et al., (2007) Blanchard, G., Bousquet, O., and Zwald, L. (2007). Statistical properties of kernel principal component analysis. Machine Learning , 66(2-3):259–294.
4Chautru, (2015) Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis. Electronic journal of statistics , 9(1):383–418.
5Chiapino and Sabourin, (2016) Chiapino, M. and Sabourin, A. (2016). Feature clustering for extreme events analysis, with application to extreme stream-flow data. In International Workshop on New Frontiers in Mining Complex Patterns , pages 132–147. Springer.
6Chiapino et al., (2019) Chiapino, M., Sabourin, A., and Segers, J. (2019). Identifying groups of variables with the potential of being large simultaneously. Extremes , 22(2):193–222.
7Cooley and Thibaud, (2016) Cooley, D. and Thibaud, E. (2016). Decompositions of dependence for high-dimensional extremes. ar Xiv preprint ar Xiv:1612.07190 .
8Einmahl et al., (2001) Einmahl, J. H., de Haan, L., and Piterbarg, V. I. (2001). Nonparametric estimation of the spectral measure of an extreme value distribution. The Annals of Statistics , 29(5):1401–1423.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Principal Component Analysis for Multivariate Extremes

Abstract

1 Introduction

1.1 Regular Variation

1.2 Dimensionality reduction for extreme values, a brief overview

1.3 Principal component analysis (PCA) and support identification

Hypothesis 1**.**

1.4 Notation and risk minimization setting

1.5 Outline

2 Consistency of risk minimizers

Lemma 2.1**.**

Lemma 2.2**.**

Proof.

Corollary 2.3**.**

Proof.

Remark 2.4*.*

Theorem 2.5**.**

Lemma 2.6**.**

Proof.

Proof of Theorem 2.5.

Lemma 2.7**.**

Proof.

Proposition 2.8**.**

Proof.

Lemma 2.9**.**

Proof.

Theorem 2.10**.**

Proof.

3 Uniform risk bounds

Theorem 3.1**.**

Lemma 3.2**.**

Proof.

Lemma 3.3**.**

Proof.

Theorem 3.4**.**

Proof.

Remark 3.5*.*

Theorem 3.6**.**

Proof.

Remark 3.7*.*

Remark 3.8*.*

4 Simulation study

5 Appendix: Details of the proof of (3.12)

Hypothesis 1.

Lemma 2.1.

Lemma 2.2.

Corollary 2.3.

*Remark 2.4**.*

Theorem 2.5.

Lemma 2.6.

Lemma 2.7.

Proposition 2.8.

Lemma 2.9.

Theorem 2.10.

Theorem 3.1.

Lemma 3.2.

Lemma 3.3.

Theorem 3.4.

*Remark 3.5**.*

Theorem 3.6.

*Remark 3.7**.*

*Remark 3.8**.*