Principal Component Analysis for Multivariate Extremes
Holger Drees, Anne Sabourin (LTCI)

TL;DR
This paper introduces a PCA-based method for identifying lower-dimensional subspaces in high-dimensional heavy-tailed data, improving the analysis of extreme events by reducing dimensionality and providing theoretical guarantees.
Contribution
It applies PCA to radially thresholded data in a heavy-tailed setting and proves convergence of the estimated subspace to the optimal one with finite sample guarantees.
Findings
Empirical risk converges uniformly to the true risk for all projection subspaces.
The estimated subspace converges in probability to the optimal subspace.
Finite sample guarantees are provided for the reconstruction error.
Abstract
The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Principal Component Analysis for Multivariate Extremes
Holger Drees
University of Hamburg, Department of Mathematics, Germany
Anne Sabourin
LTCI, Télécom Paris, Institut polytechnique de Paris, France
Abstract
The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations.
Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection subspace. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.
Key words: Principal Component Analysis, Multivariate extreme value analysis, dimensionality reduction, Empirical Risk Minimization.
MSC primary 62G32; secondary 62H25.
1 Introduction
If one wants to analyze the tail behavior of an -valued random vector one usually assumes that is regularly varying (if necessary after a standardization of the marginal distributions), i.e. there exists a normalizing function and a non-zero measure on such that
[TABLE]
for all -continuous Borel sets that are bounded away from the origin. Equation (1.1) may be understood as a generalization to arbitrary dimension of a heavy-tail assumption regarding a real-valued random variable. This mathematical framework is particularly useful in situations where the focus is on ‘tail events’ of the kind where the distance to the origin is large, for some norm . In a risk management context, the probability of such tail events is of crucial importance. If the distance is so large that few or no data are available in the considered region, all attempts to resort to empirical estimation are in vain. One common idea behind statistical methods based on Extreme Value Theory (EVT) is to use a small proportion of the available data (those with a comparatively large norm) to learn an estimate for , to be used for quantifying the probability of tail events.
1.1 Regular Variation
A substantial reference concerning the probabilistic aspects of regular variation in the setting of EVT is Resnick, (2013), see also Resnick, (2007) for application-oriented examples. Regular variation for Borel measures on Polish spaces has since been revisited in Hult and Lindskog, (2006) and Lindskog et al., (2014). It is well known that if Equation (1.1) holds true, then the limit measure is homogeneous of order for some . Moreover, the normalizing function and the norm are regularly varying, too: and as for all . Here may be any norm on , but in what follows we only consider the Euclidean norm.
Because the limit measure is homogeneous, after a polar transformation, it can be decomposed into a so-called spectral (or angular) probability measure and an independent radial component, that is
[TABLE]
for all and all Borel subsets of the unit sphere, with . Whereas the literature is plentiful concerning the design and the asymptotics of flexible multivariate parametric or non-parametric models for or integrated versions of it (see e.g. Segers, (2012); Fougères et al., (2015); Einmahl et al., (2001); Genest and Segers, (2009); Rootzén et al., (2018), or Beirlant et al., (2006) and the references therein), the issue of how to escape the curse of dimensionality has only recently been raised (see below). One reason for this may be that a major application of EVT is environmental, spatial extremes such as heavy rainfalls, heat waves, droughts or floods. In this context, max-stable or generalized Pareto spatial models are widely used (Padoan et al., (2010); Ferreira and de Haan, (2014); Schlather, (2002)) which have built in a priori information about the spatial dependence structure, thus reducing the effective dimension.
1.2 Dimensionality reduction for extreme values, a brief overview
For applications such as e.g. anomaly detection or network monitoring where no particular structure is known a priori, dimensionality reduction suggests itself as a preliminary step before implementing any kind of learning procedure and the subject is recently receiving increasing attention. If is moderate or large, the measure (and hence ) will often exhibit some ‘sparse’ structure. For example, if some of the components of are asymptotically independent, i.e. for some index set of size
[TABLE]
then is concentrated on . More generally, one may consider the case where only a small number of subsets of components are likely to be large simultaneously, while the other components remain small. Here, ‘small number’ is understood relatively to the non empty possible subsets of components. This setting applies e.g. to heavy rainfalls in a spatial setting (storms are usually localized, so that neighboring sites are more likely to be concomitantly impacted) or of shocks over different assets of a financial portfolio. Chautru, (2015) proposes a clustering approach combined with spherical data analysis to detect structures of this type. Goix et al., (2016, 2017) propose an algorithm with moderate computational cost (linear in the dimension and the sample size) and finite sample uniform guarantees. Their error bounds are linear in and scale as , where is the number of order statistics of each component which are considered extreme during the training step. A refinement of the latter framework is proposed in the yet unpublished work of Simpson et al., (2018). Chiapino and Sabourin, (2016) and Chiapino et al., (2019) aim at identifying subgroups of components for which the probability of a joint excess over a large quantile is not negligible compared to that of an excess by a single component. Engelke and Hitz, (2018) use graphical models to reduce the complexity of the extremal dependence structure. In a regression context, Gardes, (2018) sets up a mathematical framework for tail dimension reduction suited to the case where the distribution of the target variable above high thresholds only depends on the projection of the covariates on a lower dimensional subspace. Consistency of K-means clustering applied to the most extreme observations of a data set has recently been proven in the unpublished work of Janßen and Wan, (2019).
1.3 Principal component analysis (PCA) and support identification
Here we focus on finding a linear subspace on which is (nearly) concentrated. In a classical setting, when has finite second moments, PCA (Anderson, (1963)) is the method of choice to determine such supporting linear subspaces if random vectors , , with the same distribution as are observed. Theoretical guarantees obtained so far concern the reconstruction error (Koltchinskii and Giné, (2000); Shawe-Taylor et al., (2005); Blanchard et al., (2007); Koltchinskii and Lounici, (2017)) or the approximation error for the eigenspaces of the covariance matrix (Zwald and Blanchard, (2006)), under the assumption that the sample space (or the feature space for Kernel-PCA) has finite diameter or that sufficiently high order moments exist.
For motivation of our version of PCA, it is useful to keep the following working hypothesis in mind, although it is not required for most results to hold:
Hypothesis 1**.**
The vector space generated by the support of has dimension .
Note that then the points are more and more concentrated on a neighborhood of as increases, but that usually they will not lie on . If the dimension of is known, then it suggests itself to approximate by the subspace of dimension which is ‘closest’ in expectation to these points.
In PCA one measures the closeness by the squared Euclidean distance which hugely alleviates the optimization problem as one may work with orthogonal projections in the Hilbert space . However, this approach assumes finite second moments that cannot be taken for granted in the above setting. Indeed, if then . Hence, we will instead consider re-scaled vectors
[TABLE]
where is a suitable scaling function. The most common choice is , leading to on the unit sphere, and we will focus on this re-scaling when we derive finite sample bounds on the reconstruction error (see Section 3). However, consistency results will be proved for considerably more general scaling functions; cf. Section 2.
To the best of our knowledge, the only existing work considering PCA properly speaking for high dimensional extremes is the unpublished paper of Cooley and Thibaud, (2016). The authors discuss a transformation mapping negative observations to small positive ones and apply PCA in this transformed space. They also use a preliminary re-scaling involving the norm of the transformed vector. They illustrate their approach with simulations and real data examples, without deriving theoretical statistical guarantees.
1.4 Notation and risk minimization setting
To give a formal description of our method, we first introduce some notation. All random variables are defined on some probability space ; the expectation with respect to is denoted by . For and , let
[TABLE]
By we denote the distribution of and by its conditional distribution given that , i.e. . Then is the weak limit of (with denoting the closed unit ball); cf. (1.1).
For any probability measure and any -integrable function , we denote the expectation of with respect to by or . By we denote the conditional expectation (with respect to ) given so that , provided the expectations exist.
For any linear subspace , let be the orthogonal projection onto (or the associated projection matrix), and let be the orthogonal projection onto the orthogonal complement of .
To apply PCA to the re-scaled vectors, we have to assume that the scaling function is chosen such that and . Note that this condition is always fulfilled if there exist and such that for all . For simplicity’s sake, in what follows we will impose the following stronger homogeneity condition:
[TABLE]
where denotes the unit sphere. Note that then . The choice seems natural, but different choices allow for focusing on particular aspects of the extreme value behavior. For instance, if one is only interested in the positive components of , one may choose .
Hypothesis 1 is equivalent to the statement that and for all where
[TABLE]
and the infima are taken over all linear subspaces of the specified dimension. The risk may be interpreted as the expected reconstruction error in the limit model if the re-scaled observation is replaced with its lower dimensional approximation . Since weakly, one may approximate by a subspace of dimension which minimizes the conditional risk
[TABLE]
given that exceeds a high threshold . Note that may be of interest even if Hypothesis 1 only holds approximately, in the sense that concentrates most of its mass on a small neighborhood of a -dimensional subspace.
It is natural to ‘estimate’ (and thus ) by a minimizer of the corresponding empirical risk
[TABLE]
Here the threshold must be chosen suitably, depending on the sample size. To this end, often order statistics of the norms of the observed vectors are used, and we follow this approach. Let where is a permutation of indices such that . (For brevity, we suppress the dependence on in our notation of order statistics.) For , denote by the empirical quantile of level for . We define the empirical risk for the subspace related to the largest observations as
[TABLE]
where in accordance with the notation introduced in (1.4). A minimizer of among all linear subspaces of dimension will be denoted by . It is the main goal of the present paper to analyze the asymptotic and the finite sample behavior of the empirical risk and its minimizer .
1.5 Outline
In Section 2 we will first show that the minimizer of the risk based on a finite threshold converges to the minimizer of the limit risk , and thus under Hypothesis 1 to , as . Moreover, we show consistency of the empirical minimizer under condition (1.5) with suitable . In Section 3, we derive non-asymptotic uniform bounds on and for the most important scaling . Furthermore, we construct uniform confidence bands for . The results obtained in a simulation study are reported in Section 4. In particular, we explore the choice of the dimension based on empirical risk plots and the effect of a PCA projection on estimators of probabilities expressed in terms of the spectral measure . Finally, Section 5 contains some details about the proof of a modification of a result by Blanchard et al., (2007).
2 Consistency of risk minimizers
In this section we first discuss how to calculate minimizers of the conditional risk given and the empirical risk . Moreover, we prove that these converge in some sense towards a minimizer of .
It is well known that a point of minimum of can be derived from the spectral analysis of the matrix of second (mixed) moments of :
Lemma 2.1**.**
- (i)
Let be an -valued random vector with and . Let denote the eigenvalues of with corresponding orthogonal eigenvectors . Then minimizes among all linear subspaces of dimension . In the case it is the unique minimizer. 2. (ii)
If the scaling condition (1.5) holds and denote the eigenvalues of with corresponding orthogonal eigenvectors , then minimizes among all linear subspaces of dimension . In the case it is the unique minimizer. 3. (iii)
If the scaling condition (1.5) holds and denote the eigenvalues of with corresponding orthogonal eigenvectors , then minimizes among all linear subspaces of dimension .
A proof of assertion (i) can e.g. be found in Seber, (1984), Theorem 5.3, where also other optimality properties of the minimizers are given. Both the other results follow directly by an application of (i) with equal to conditional on , respectively a random variable according to the empirical distribution of the for which . If , then the minimizer is not unique. With any minimizer of can be represented as where are orthogonal eigenvectors to the eigenvalue and all these subspaces are minimizers. An analogous statement holds for the empirical risk.
Next we discuss the relationship between and and their respective minimizers. The convergence of the risks is an immediate consequence of the following simple lemma.
Lemma 2.2**.**
Let be a measurable function that is locally bounded, -a.e. continuous and satisfies for some . Then .
Proof.
According to (1.1), weakly. Let be a random vector with distribution and a random vector with distribution . Since , the assertion follows if the are asymptotically uniformly integrable (see Van der Vaart, (2000), Theorem 2.20).
By assumption can be bounded by a multiple of . Now, for all and for some sufficiently large , integration by parts, regular variation of and Karamata’s theorem yield
[TABLE]
In particular, for , so that and thus are asymptotically uniformly integrable. ∎
Corollary 2.3**.**
Suppose that fulfills condition (1.5). Then, for any subspace of , the suitably standardized associated finite threshold risk converges:
[TABLE]
Proof.
Note that by the homogeneity of ,
[TABLE]
with . Since , Lemma 2.2 yields the assertion. ∎
In view of Corollary 2.3, one may ask whether a minimizer of (which of course is also a minimizer of ) converges in some sense to a minimizer of . Denote by the set of all subspaces of of dimension , endowed with the metric , where denotes the operator norm.
Remark 2.4*.*
Note that also gives an upper bound on the Hausdorff distance between and . To see this, let and be such that the Hausdorff distance equals . Then , and . Hence
[TABLE]
Theorem 2.5**.**
Suppose that satisfies condition (1.5) and that has a unique minimizer in . Then for any minimizer of in one has
[TABLE]
The following lemma plays a crucial role in the proof of Theorem 2.5.
Lemma 2.6**.**
If satisfies condition (1.5), then for sufficiently large , the standardized risks , , are equicontinuous w.r.t. .
Proof.
First note that \big{|}\|\mathbf{\Pi}_{V}^{\perp}\theta(x)\|-\|\mathbf{\Pi}_{W}^{\perp}\theta(x)\|\big{|}\leq\|\mathbf{\Pi}_{V}^{\perp}\theta(x)-\mathbf{\Pi}_{W}^{\perp}\theta(x)\|\leq\|\theta(x)\|\rho(V,W)\leq c_{\omega}\|x\|^{1-\beta}\rho(V,W). Choose as in the proof of Lemma 2.2 and recall the definition of given there. Then, by (2.1), for all subspaces of
[TABLE]
which proves the assertion. ∎
Proof of Theorem 2.5.
We first prove that is compact w.r.t. . The assertion then follows by standard arguments using Lemma 2.6.
We have to show that any sequence in has a convergent subsequence. For each , let be an orthonormal basis for so that where denotes the matrix with columns . The vectors belong to the compact set . Thus there exists a subsequence such that for all . Since for all , , we also have and the , form an orthonormal family in . Let be the space generated by the ’s and denote by the matrix with these columns. Then has dimension , i.e. , and by construction
[TABLE]
which proves the claimed compactness.
Now assume that the assertion of the theorem was wrong. By the compactness of , then there exist a sequence such that converges to some . By Lemma 2.6, , and by Corollary 2.3 and . Hence, for , that is strictly positive by assumption, and sufficiently large , one may conclude a contradiction:
[TABLE]
Therefore, the assertion must be correct. ∎
Under Hypothesis 1, is the unique minimizer of over , that is if we minimize the risk over linear subspaces with the correct dimension, as the following result shows. Hence in this case, converges to .
Lemma 2.7**.**
Under Hypothesis 1, for any subspace of arbitrary dimension one has
[TABLE]
Thus, is the unique minimizer of in , whereas on with the points of minimum of the limit risk are not unique.
Proof.
If then . By Hypothesis 1, is concentrated on , which implies .
Conversely, if , then . By definition of and the homogeneity of , this means that the support of must be a subset of and thus . ∎
In the remaining part of this section, we will establish analogous consistency results for the empirical risk and its minimizer. In what follows, let be the c.d.f. of , its generalized inverse (quantile function) and define
[TABLE]
We start with consistency of the standardized empirical risk.
Proposition 2.8**.**
If satisfies condition (1.5), then in probability for all linear subspaces of .
Proof.
For simplicity, we assume that is continuous in the tail (so that there are no ties among the observed norms), but the proof can be easily generalized using standard techniques from the theory of regular varying functions. First we want to replace the random threshold with in the definition of . Observe that by the Hölder inequality
[TABLE]
where is chosen such that .
It is well known that in probability. Thus there exists a sequence such that . By (2.1) and the regular variation of
[TABLE]
In particular, is stochastically bounded.
Furthermore,
[TABLE]
because there exist exactly exceedances of , and either all non-vanishing differences of the indicator functions equal 1 or all equal , depending on whether or . Now, the last sum is binomially distributed with parameters and . By the central limit theorem for triangular arrays, the right hand side is of the stochastic order .
A combination of these results show that
[TABLE]
uniformly for all subspaces .
In view of Corollary 2.3, it thus suffices to show that
[TABLE]
in probability, with .
Let . By similar calculation as in the proof of Corollary 2.3
[TABLE]
Similarly as in the proof of Lemma 2.2, we can bound the expectation using integration by parts and Karamata’s theorem:
[TABLE]
for sufficiently large . Therefore, by the choice of
[TABLE]
which implies the convergence in probability of .
For the second term, we may conclude from (2.1) and the definition of that
[TABLE]
Because is regularly varying with index and , the right hand side tends to 0. This proves that converges to 0 in probability, which concludes the proof. ∎
The following result is an analog to Lemma 2.6.
Lemma 2.9**.**
If satisfies condition (1.5), then for all there exists such that for sufficiently large
[TABLE]
Proof.
First note that in view of (2.2), it suffices to prove the assertion with replaced by and replaced by the analogous expression.
Similarly as in the proof of Lemma 2.6, we have
[TABLE]
for sufficiently large. Hence, by Markov’s inequality, (2.1) and the definition of ,
[TABLE]
for \delta:=\varepsilon^{2}(\alpha-2(1-\beta))/\big{(}8(1-\beta)c_{\omega}^{2}\big{)}. ∎
We are now ready to prove weak consistency of the empirical risk minimizer.
Theorem 2.10**.**
If satisfies condition (1.5) and has a unique minimizer in , then for all minimizers of in one has in probability.
Proof.
Let . Fix an arbitrary and let . By the arguments used in the proof of Lemma 2.6, it is easily seen that is (Lipschitz) continuous w.r.t. . Moreover, is compact (see proof of Theorem 2.5), and so is . Hence, , since the infimum is attained and is the unique minimizer of .
According to Lemma 2.9, there exists and such that for all with probability greater than one has
[TABLE]
for all such that . Since is compact, there exists a finite cover of by open balls with radius and centers , say. By Proposition 2.8, there exists such that with probability greater than
[TABLE]
Hence, on a set with probability greater than , there exists such that and
[TABLE]
By the definition of , this implies and thus
[TABLE]
Since was arbitrary, this concludes the proof. ∎
So far, we have proved weak consistency of both the standardized empirical risk and the empirical risk minimizer under mild assumptions on the scaling function . However, the rates of convergence may be arbitrarily slow. As condition (1.5) does not guarantee any finite moments of of order greater than 1, it will not suffice to establish useful risk bounds. In the next section, we therefore analyze the recovery risk under the stronger assumption that is bounded.
3 Uniform risk bounds
Since a minimizer of the empirical risk (or of ) differs from the minimizer of the true risk , usually the so-called excess risk will be strictly positive. We follow the common approach in the theory of risk minimization to bound the excess risk by deriving uniform bounds on which hold with high probability for a fixed sample size . If these uniform bounds can be calculated from the observed data, they may also be used to construct confidence intervals for the reconstruction error resp. .
Since tight concentration inequalities are available only for subgaussian distributions, in this section we will assume that the scaling function satisfies the following condition:
[TABLE]
so that for all . Moreover, we suppose that the c.d.f. of is continuous in the tail to avoid technicalities. Then we may assume w.l.o.g. that there are no ties and thus exactly observations with norm larger than .
For classical PCA (and a kernel version thereof), Shawe-Taylor et al., (2005) established uniform risk bounds for bounded random vectors , which were improved by the following result by Blanchard et al., (2007). Assume , denote the empirical matrix of second (mixed) moments by and the Hilbert-Schmidt norm on the space of matrices by . Then, with probability greater than
[TABLE]
for all . One may try to derive uniform risk bounds in our extreme value setting by applying this result to the random variables , so that the left hand side is approximately equal to with
[TABLE]
if one ignores the difference between and its expectation . In the case , however, the above upper bound will not even converge to 0 when it is divided by because of the second term. Hence this direct approach does not give meaningful bounds for .
The reason for this inconsistency is that, unlike in the classical setting, most of the random variables will vanish as increases, and the concentration inequalities used in the proofs of the aforementioned bounds are too crude in such a situation. However, we will take up ideas used by Blanchard et al., (2007), with appropriate modifications, to derive much tighter uniform bounds on . Furthermore, we will derive uniform bounds on which hold conditionally on and depend only on the data. These can then be used to construct confidence bands for .
Before we establish these bounds, we first recall some well-known facts about Hilbert spaces specialized to the present setting, and introduce some notation. Let be an arbitrary orthonormal basis of and denote by the usual inner product on . The space of linear operators from to (i.e., -matrices) equipped with the inner product is a Hilbert space. The corresponding Hilbert Schmidt norm can be expressed as \|A\|_{HS}=\big{(}\sum_{i=1}^{d}\|Ae_{i}\|^{2}\big{)}^{1/2}=\big{(}\operatorname{tr}(AA^{\top})\big{)}^{1/2} with denoting the trace operator. If, for any subspace of , one chooses the first vectors to form an orthonormal basis of , then one sees that
[TABLE]
Moreover, direct calculations show that
[TABLE]
Finally, for independent centered random matrices , , one has
[TABLE]
If, for the time being, one neglects the difference between the empirical and the true -quantile of , then can be approximated by where
[TABLE]
Denote the empirical distribution of the observed random vectors , , by . For any threshold , the maximal difference between the approximate empirical risk and the true risk can be rewritten as
[TABLE]
with
[TABLE]
For brevity’s sake, in what follows we use the notation for a subvector of .
In order to derive uniform bounds on the difference between the empirical and the true risk, we first establish concentration inequalities for using a version of the bounded difference inequality by (McDiarmid,, 1998, Theorem 3.8), which we recall for convenience.
Theorem 3.1**.**
Let be an sample taking its values in some space and be any measurable function. Consider the positive deviation functions, defined for and for ,
[TABLE]
Denote their maximum by
[TABLE]
and the maximal summed variance by
[TABLE]
If both and are finite, then for all ,
[TABLE]
Lemma 3.2**.**
For all ,
[TABLE]
Proof.
The assertion follows immediately from Theorem 3.1 applied to and the following bounds:
[TABLE]
and
[TABLE]
∎
The expectation can be bounded using arguments from Blanchard et al., (2007).
Lemma 3.3**.**
[TABLE]
with .
Proof.
Since, by (3.3), for any linear subspace and any , using the bilinearity of the inner product and the Cauchy-Schwarz inequality in the Hilbert-Schmidt space, we obtain
[TABLE]
Using (3.2) and taking the supremum over all and the expectation, one arrives at
[TABLE]
One the other hand, by first rewriting , analogously one obtains
[TABLE]
Now, by the Cauchy-Schwarz inequality and (3.4),
[TABLE]
Combining this with (3.8) and (3.9), we arrive at
[TABLE]
It remains to show that \operatorname{\mathbb{E}}\|\Theta_{t}\Theta_{t}^{\top}-\operatorname{\mathbb{E}}\Theta_{t}\Theta_{t}^{\top}\|^{2}_{HS}=\pi_{t}\big{(}\operatorname{\mathbb{E}}_{t}\|\Theta\|^{4}-\pi_{t}\operatorname{tr}(\Sigma_{t}^{2})\big{)}. From the representation of the Hilbert Schmidt norm by the trace operator and the linearity of the latter, one may conclude by direct calculations that
[TABLE]
Hence the assertion follows from
[TABLE]
with denoting the th unit vector. ∎
Now we are ready to state our first uniform risk bound. Recall that .
Theorem 3.4**.**
For all one has
[TABLE]
with .
In particular, with probability greater than or equal to
[TABLE]
Note that (3.11) also implies an upper bound on the excess risk:
[TABLE]
Proof.
With defined in (3.5), we have
[TABLE]
By similar arguments as in the proof of Proposition 2.8, we see that, for all ,
[TABLE]
By Bernstein’s inequality, it follows that
[TABLE]
For the second term , Lemma 3.2, Lemma 3.3 and immediately yield
[TABLE]
which concludes the proof of the first assertion.
Check that for
[TABLE]
both exponential expressions on the right hand side of (3.10) equal , and so the upper bound equals . Hence the remaining assertions follow from . ∎
Remark 3.5*.*
In the case , the upper bound in (3.11) simplifies to
[TABLE]
Note that the upper bound in Theorem 3.4 cannot be calculated from the data and can thus not directly be used to construct confidence intervals for the true reconstruction error or the minimal reconstruction error . Next, we derive data-dependent bounds directly from (a minor improvement of) the bound established by Blanchard et al., (2007). However, this result will be applied to the conditional distribution of given and the resulting bound is to be interpreted conditional on the number of exceedances over the chosen threshold .
Theorem 3.6**.**
For all ,
[TABLE]
with \tilde{S}_{t}:=N_{t}^{-1}\sum_{i=1}^{n}\|\Theta_{i,t}\|^{4}-\operatorname{tr}\Big{(}(N_{t}^{-1}\sum_{i=1}^{n}\Theta_{i,t}\Theta_{i,t}^{\top})^{2}\Big{)} and .
If, for all , constants are chosen such that 2\exp\big{(}-2\ell u_{\ell}^{2})+\exp\big{(}-\lfloor{\ell/2}\rfloor v_{\ell}^{2}/2\big{)}=1-\alpha, then I_{\ell}(V):=\big{[}\hat{R}_{t}(V)-B_{t,\ell},\hat{R}_{t}(V)+B_{t,\ell}\big{]}\cap[0,\infty) with
[TABLE]
defines a uniform level confidence band for , , conditionally on . If one defines , then defines a uniform level confidence band for , (unconditionally).
Proof.
Define random vectors whose distribution equals the conditional distribution of given . Recall that where is the vector with the th largest norm among . Then, conditionally on , the joint distribution of the empirical risk and equals the joint distribution of and the order statistics of . Therefore, the proof of Theorem 3.1 of Blanchard et al., (2007) (with and ) combined with arguments given in the proof of Lemma 3.3 show that
[TABLE]
Since the proof of Theorem 3.1 of Blanchard et al., (2007) is quite tersely formulated in a more abstract setting and contains a minor inaccuracy, for convenience we give more details of the proof of (3.12) in the Appendix.
In the same way as in the proof of Lemma 3.3, the first assertion thus follows from
[TABLE]
where in the last step we have used that, on , the set of non-vanishing vectors equals the set of non-vanishing random vectors .
The remaining assertions are now obvious. ∎
Remark 3.7*.*
In the statement about the confidence bands one may replace with
[TABLE]
This half width of a confidence band is more suitable for (numerical) minimization (as a function of and ) under the constraint 2\exp\big{(}-2\ell u_{\ell}^{2})+\exp(-\lfloor{\ell/2}\rfloor v_{\ell}^{2}/2)=1-\alpha.∎
Remark 3.8*.*
The (modified) proof of Theorem 3.1 of Blanchard et al., (2007) also shows that
[TABLE]
with . Observe that . On the set , one thus has
[TABLE]
since . Moreover, for , it has been shown in the proof of Theorem 3.4 that on the set and that \mathbb{P}(M_{t_{n,k}}^{c})\leq 2\exp\big{(}-kv^{2}/(2(1+v/3))\big{)}. Hence,
[TABLE]
A comparison with Theorem 3.4 reveals that the new bound may be tighter if is substantially smaller than . This will be the case if is small and \operatorname{tr}\big{(}(\operatorname{\mathbb{E}}_{t}\Theta\Theta^{\top})^{2}\big{)} is not much smaller than . ∎
So far, we have compared empirical risks with the true risk for finite thresholds . A comparison with the limit risk would require second order refinements of our basic assumption (1.1). Let and . Denote the eigenvalues of by . Then standard calculations from classical PCA show that
[TABLE]
where the second supremum is taken over all -matrices with orthogonal columns. Likewise, and hence
[TABLE]
Therefore, bounds on the difference between empirical risks and the limit risk require additional assumptions on the spectrum of the difference between the matrix of second moments for the re-scaled exceedances over the threshold and the corresponding matrix in the limit model.
If one merely wants to compare the minimum risk for finite thresholds with the minimum limit risk, which equal the sums of smallest eigenvalues of resp. , then somewhat weaker assumptions on the convergence of the spectrum of and are needed. In particular, under Hypothesis 1, equals the sum of the smallest eigenvalues of .
4 Simulation study
We investigate the performance of our PCA procedure. In particular, we examine how the standard non-parametric estimator of the spectral measure (defined via (1.2)) based on the largest observations
[TABLE]
(with ) is influenced if the data is first projected onto a lower dimensional subspace using PCA:
[TABLE]
Here, is the Dirac measure with point mass at and denotes the subspace picked by PCA based on the same number of largest observations. It will turn out that sometimes it is advisable to use a smaller number for the PCA procedure; the resulting estimator of the spectral measure will be denoted by .
To measure the performance of the spectral estimators, we consider the resulting estimators of the following probabilities in the limit model, that can be expressed in terms of the spectral measure:
- (i)
for some
- (ii)
\int\big{(}(\min_{1\leq j\leq p}x^{j})^{\alpha}-(\max_{p+1\leq j\leq d}x^{j})^{\alpha}\big{)}^{+}\,H(dx)
- (iii)
- (iv)
The first probability is related to the cdf of the mean contribution of the first coordinates to the norm of the random vector, thus quantifying, in some sense, how strongly the norm is spread over the coordinates. Probability (ii) indicates how likely it is that the first components are all large, while this is not true for any of the other components, given that the norm of the vector is large. Probability (iii) specifies how likely it is that the first component is extreme, given that any component is extreme. In a financial context, such probabilities are used to quantify how strongly a specific market participant is exposed to a failure of any market participant. Finally, probability (iv) specifies the minimal contribution of any coordinate to the norm. Note that under Hypothesis 1 this probability equals 0. The other true values are determined by Monte Carlo simulations with sample size of at least , unless they can be easily calculated analytically; the approximation error is always smaller than . Throughout, we assume to be known since we are interested in the effect of the PCA procedure on the estimator of the spectral measure, which should not be compounded with the estimation error of the tail index.
We consider different models of -dimensional regularly varying vectors for which the spectral measure is (approximately) concentrated on a -dimensional subspace. Since PCA is equivariant under rotations, w.l.o.g. we may assume that this subspace is spanned by the first unit vectors.
Two different models for the extreme value dependence structure between the first coordinates of the vector are investigated. First, we consider the so-called Dirichlet model; see, for instance, Segers, (2012), Ex. 3.6, where also a simple algorithm is given to simulate vectors with such an extremal dependence structure. Second, we simulate random vectors with a Gumbel copula C_{\vartheta}(x)=\exp\big{(}-\big{(}\sum_{i=1}^{p}(-\log x_{i})^{\vartheta}\big{)}^{1/\vartheta}\big{)}, using the transformation method proposed by Stephenson, (2003). The marginal distributions are chosen as a Fréchet distribution with cdf , .
In addition, we have simulated observations from a Dirichlet model which are then rotated in the plane spanned by two randomly chosen coordinates, one of them among the first coordinates, the other among the last . The rotation angle is uniformly distributed on the interval . Note that, unlike in the first two models, Hypothesis 1 is not fulfilled here which allows to evaluate how sensitive PCA is to moderate deviations from this ideal situation.
In all cases, we add the modulus of a -dimensional multivariate normal vector with suitable variances and constant correlations . This way, it is ensured that the support of the exceedances over high thresholds is not fully concentrated on the -dimensional subspace. The variances are chosen equal to for (i.e., if we start with unit Fréchet margins) and equal to for , so that the sparsity assumption becomes apparent for the most extreme observations, whereas large yet less extreme data points are more spread out.
In all settings, we simulate samples of size and examine the performance of the PCA procedure based on for the vectors with largest norms for . The results reported here are based on 1000 simulations in each setting.
We first discuss the simulation results for the Dirichlet model with all Dirichlet parameters , , equal to 3 and unit Fréchet margins. Figure 1 shows the mean empirical risk in the left plot as a function of for the PCA which projects onto a -dimensional subspace with ; here the true equals 2 and the vectors have dimension . Since the mean empirical risk cannot be observed if one analyzes a given data set, the right plot shows the corresponding empirical risk for a single data set. The structure of both plots is very similar: essentially, the mean empirical risk curves are just a bit smoother. For this reason, in the remaining settings, we will only report the mean empirical risk.
It is obvious from the risk plot that is a good choice, since there is a big gap to the empirical risk for , whereas the empirical risk almost vanishes for small and , and the risk decreases more regularly for values , with no obvious structural breaks. The growing influence of the multivariate normal component as increases is manifest in these plots, since the empirical risk quickly increases with for all choices of . This suggests to choose rather small to detect the sparsity in the model, a finding which will be corroborated in the analysis of the estimator of the spectral measure below.
In Figure 2, the mean operator norm of the difference between the projection onto the true support of the limit measure and the projection onto the subspace of dimension 2 chosen by PCA is plotted versus . Again it becomes obvious that for less extreme observations the approximation by a lower-dimensional vector is rather poor, which leads to a larger error for the projection matrix estimated from these data. For , the norm has almost reached its maximal value. However, one should keep in mind that the operator norm measures the maximal distance between the projection of some vector onto the estimated respectively the true subspace. If the underlying distribution of puts little mass on vectors for which the distance is large, the true risk corresponding to the estimated subspace may still be small.
Next we consider the estimators of the probabilities (i)–(iv), obtained by replacing the spectral measure either with or . Since the PCA estimator of the subspace supporting quickly deteriorates as increases, in addition we consider the estimators resulting from , that uses just the largest 10 observations to estimate the supporting subspace.
Figure 3 displays the root mean squared errors (RMSE) of the resulting estimators as a function of . For very small values of , all estimators perform similarly. For probability (i) with (leading to a true value of about 0.684), both PCA based estimators have a considerably smaller RMSE than the standard estimator for most . In particular, the PCA based method using just 10 largest observations to estimate the support of the spectral measure clearly outperforms both other estimators (almost) irrespective of the number of observations used for estimation of the spectral measure.
For the estimation of probability (ii) (), the standard non-parametric estimator performs best for . The classical PCA using the same number of order statistics in both steps performs better for larger values of and its minimum RMSE is a bit lower than that of the standard estimator. The PCA based estimator which determines the support of from the largest 10 observations has a very stable RMSE, but its minimum is much larger than that both of the other estimators.
In case (iii) (with true value of about ), the RMSE of the standard estimator and the estimator based on are very similar for up to about 80, but the latter is remarkably insensitive to the choice of up to 200. This feature might be useful in practical applications where the selection of is often tricky. In contrast, the PCA based procedure which uses the same number of largest observations in both steps is even more sensitive to this choice than the standard estimator.
Similarly, the classical PCA estimator of probability (iv) strongly depends on the choice of while both other estimators stably have a very low error.
Next, we consider the Dirichlet model with total dimension when the limit measure is concentrated on a dimensional subspace. Figure 4 shows the mean empirical risk for PCA projecting on a dimensional subspace in the left plot and the mean operator norm of the difference between the estimated and the true projection matrix in the right plot. The empirical risk suggests to choose between 4 and 6 and not much larger than 50 for estimating the support of the limit measure.
Figure 5 shows the RMSE of the different estimators of the probabilities (i)–(iv) with and true values and 0, respectively. Here, we have used PCA with in the upper row, in the mid row and in the lower row. As expected, in most cases the PCA procedures perform worse when they project on too low dimensional subspaces, yet in the cases (i) and (iv) the differences are moderate. At first glance somewhat surprisingly, overall the PCA procedures exhibit a better behavior for than for the “correct” value . This may be explained by the fact that the extra dimension offers the opportunity to compensate for the difference between the subspaces minimizing the true resp. the empirical risk. This difference is expected to be larger if the dimension of the observed vectors is large, as can also be seen from the right plot in Figure 4.
Again, the PCA based estimators for probability (i) outperform the standard procedure, but the other probabilities are more accurately estimated by the standard procedures if (though all estimators of (iv) perform reasonably well). For , the RMSE of both variants of PCA based estimators of (ii) are very similar with a minimum value that is somewhat smaller than the minimum RMSE of the standard estimator. The performance of the standard estimator and the one based on classical PCA are almost identical for the probability (iii), while the estimator with PCA based on just largest observations is less accurate, probably because it is difficult to estimate a subspace of dimension 6 based on just 10 observations. It might help to increase the number of largest observations used to estimate the supporting subspace with the dimension , but we do not explore this idea here in order not to overload the presentation.
The mean empirical risk and the mean operator norm of the difference matrix are shown for the Gumbel copula with , and in Figure 6. Here, we have chosen Fréchet marginal distributions with cdf , . Based on the left plot, one may choose , or perhaps .
Figure 7 displays the RMSE of the estimators of (i)–(iv) with PCA projecting on two-dimensional subspaces. Here and the true values for (i)–(iv) are 0.3813, , and 0. Now the PCA which uses the same number in both steps performs worse than the standard estimator for probability (i), better for (ii) and very similar to the standard procedure for (iii) and . The PCA estimator which uses just 10 largest observations for estimating the support again outperforms the standard procedure for probability (i) and (ii), whereas it is has a slightly larger RMSE for the probability (iii). In any case, as in the Dirichlet model, its RMSE is rather insensitive to the choice of . If one chooses (plots not shown here), then the classical PCA has almost the same RMSE as the standard procedure for the probabilities (i) and (iii). The same holds true for the estimator based on for (iii) and (iv), while for (i) this estimator is still considerably more accurate than the standard procedure (though less so than for ) and both PCA procedures still outperform the standard estimator for probability (ii).
For the high-dimensional Gumbel model with and , by and large the findings are similar to the ones observed for the Dirichlet model so that we do not show the corresponding plots. However, in this model can be ruled out by the empirical risk plot and both PCA based estimators outperform the standard estimator of (ii).
Finally, we turn to the disturbed Dirichlet model with and where the observations are randomly rotated by an angular up to , leading to true values for (i)–(iv) of 0.653 (with ), 0.185, 0.770 and 0. The corresponding plots are shown in the Figures 8 and 9. In view of the empirical risk, the choices seem reasonable.
Again, the PCA procedure which uses the same largest observations in both steps performs better for the larger choice of , whereas the performance of the other PCA procedure improves only for (ii), while it does not change much for (iii) and it deteriorates a bit for (i) and (iv). The PCA estimators perform better than the standard procedure for probability (i) and for (iii) if is large (for classical PCA only if ), whereas for (ii), roughly speaking, overall the estimators perform similarly well with the standard procedure performing better for small values of and the PCA estimators for larger values.
To sum up, the plot of the empirical risk seems a useful tool to choose the dimension of the subspace onto which the PCA procedure projects. In particular, for the PCA method which uses the same number of largest observations to estimate the support and to calculate the estimator of the spectral measure, in case of doubt it seems advisable to project onto a subspace of higher dimension. While the PCA step does not always improve the estimator of the spectral measure, in most cases the resulting estimators seem competitive with the standard ones if is chosen appropriately, and for probability (i) they are superior. In particular the PCA estimators which determine the support based only on the largest 10 observations often exhibit a desirable insensitivity to the choice of largest observations used to estimate the spectral measure.
5 Appendix: Details of the proof of (3.12)
Recall that are iid random variables whose distribution equals the conditional distribution of given . Let
[TABLE]
First note that
[TABLE]
for all . Thus a simple version of the bounded difference inequality (see, e.g., Theorem 3.1 of McDiarmid, (1998)) gives
[TABLE]
Exactly in the same way as in the proof of Lemma 3.3, one obtains
[TABLE]
Let be an independent copy of . Then
[TABLE]
with
[TABLE]
To sum up, so far we have shown that
[TABLE]
Next consider the U-statistic with
[TABLE]
By equation (5.7) of Hoeffding, (1963), one has
[TABLE]
with . Hence,
[TABLE]
This, however, is equivalent to (3.12), because the joint distribution of \max\big{(}\phi^{+}(Z_{1:\ell}), \phi^{-}(Z_{1:\ell})\big{)} and is the same as the joint conditional distribution of and , given .
Acknowledgements: Holger Drees was partly supported by DFG grant DR271/6-2. Anne Sabourin was partly supported by the Chaire ‘Stress testing’ from École Polytechnique and BNP Paribas.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anderson, (1963) Anderson, T. W. (1963). Asymptotic theory for principal component analysis. The Annals of Mathematical Statistics , 34(1):122–148.
- 2Beirlant et al., (2006) Beirlant, J., Goegebeur, Y., Segers, J., and Teugels, J. L. (2006). Statistics of extremes: theory and applications . John Wiley & Sons.
- 3Blanchard et al., (2007) Blanchard, G., Bousquet, O., and Zwald, L. (2007). Statistical properties of kernel principal component analysis. Machine Learning , 66(2-3):259–294.
- 4Chautru, (2015) Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis. Electronic journal of statistics , 9(1):383–418.
- 5Chiapino and Sabourin, (2016) Chiapino, M. and Sabourin, A. (2016). Feature clustering for extreme events analysis, with application to extreme stream-flow data. In International Workshop on New Frontiers in Mining Complex Patterns , pages 132–147. Springer.
- 6Chiapino et al., (2019) Chiapino, M., Sabourin, A., and Segers, J. (2019). Identifying groups of variables with the potential of being large simultaneously. Extremes , 22(2):193–222.
- 7Cooley and Thibaud, (2016) Cooley, D. and Thibaud, E. (2016). Decompositions of dependence for high-dimensional extremes. ar Xiv preprint ar Xiv:1612.07190 .
- 8Einmahl et al., (2001) Einmahl, J. H., de Haan, L., and Piterbarg, V. I. (2001). Nonparametric estimation of the spectral measure of an extreme value distribution. The Annals of Statistics , 29(5):1401–1423.
