TL;DR
This paper introduces data-dependent weights to multivariate sign functions, enhancing robustness and efficiency in estimation and inference, with applications in location estimation, PCA, dimension reduction, and outlier detection.
Contribution
It proposes a novel weighted sign function approach that improves robustness and efficiency over traditional methods in multivariate analysis.
Findings
Weighted signs retain robustness properties.
Significant efficiency improvements demonstrated.
Effective in various multivariate applications.
Abstract
Multivariate sign functions are often used for robust estimation and inference. We propose using data dependent weights in association with such functions. The proposed weighted sign functions retain desirable robustness properties, while significantly improving efficiency in estimation and inference compared to unweighted multivariate sign-based methods. Using weighted signs, we demonstrate methods of robust location estimation and robust principal component analysis. We extend the scope of using robust multivariate methods to include robust sufficient dimension reduction and functional outlier detection. Several numerical studies and real data applications demonstrate the efficacy of the proposed methodology.
| 4-variate | SCM | Tyler | -H | -M | -P | -H | -M | -P |
|---|---|---|---|---|---|---|---|---|
| =20 | 1.04 | 1.02 | 1.10 | 1.07 | 1.02 | 1.09 | 1.07 | 0.98 |
| =50 | 1.08 | 1.08 | 1.16 | 1.16 | 1.13 | 1.19 | 1.19 | 1.13 |
| =100 | 1.31 | 1.31 | 1.42 | 1.38 | 1.36 | 1.46 | 1.44 | 1.36 |
| =300 | 1.46 | 1.54 | 1.81 | 1.76 | 1.95 | 1.88 | 1.88 | 1.95 |
| =500 | 1.92 | 1.93 | 2.23 | 2.03 | 2.31 | 2.35 | 2.19 | 2.39 |
| 4-variate | SCM | Tyler | -H | -M | -P | -H | -M | -P |
| =20 | 1.00 | 1.05 | 1.03 | 1.05 | 1.00 | 1.04 | 1.04 | 0.95 |
| =50 | 1.03 | 1.01 | 1.13 | 1.12 | 1.11 | 1.19 | 1.17 | 1.10 |
| =100 | 1.08 | 1.12 | 1.25 | 1.23 | 1.27 | 1.24 | 1.25 | 1.22 |
| =300 | 1.34 | 1.36 | 1.64 | 1.52 | 1.60 | 1.67 | 1.61 | 1.68 |
| =500 | 1.26 | 1.34 | 1.55 | 1.49 | 1.60 | 1.65 | 1.61 | 1.69 |
| 4-variate | SCM | Tyler | -H | -M | -P | -H | -M | -P |
| =20 | 0.90 | 0.89 | 0.95 | 0.98 | 0.98 | 0.96 | 1.01 | 0.95 |
| =50 | 0.90 | 0.91 | 1.01 | 0.98 | 0.98 | 1.03 | 1.04 | 0.99 |
| =100 | 0.87 | 0.87 | 0.93 | 0.95 | 1.01 | 0.99 | 1.01 | 1.05 |
| =300 | 0.87 | 0.87 | 1.09 | 1.09 | 1.17 | 1.14 | 1.16 | 1.23 |
| =500 | 0.88 | 0.92 | 1.10 | 1.10 | 1.23 | 1.19 | 1.22 | 1.29 |
| 4-variate | SCM | Tyler | -H | -M | -P | -H | -M | -P |
| =20 | 0.92 | 0.90 | 0.94 | 0.94 | 0.96 | 0.95 | 0.97 | 0.89 |
| =50 | 0.82 | 0.83 | 0.88 | 0.91 | 0.93 | 0.88 | 0.93 | 0.93 |
| =100 | 0.84 | 0.87 | 0.92 | 0.95 | 1.00 | 0.93 | 0.96 | 1.00 |
| =300 | 0.73 | 0.75 | 0.96 | 0.99 | 1.10 | 1.00 | 1.06 | 1.12 |
| =500 | 0.73 | 0.76 | 0.95 | 0.96 | 1.06 | 0.94 | 0.97 | 1.06 |
| 4-variate | SCM | Tyler | -H | -M | -P | -H | -M | -P |
| =20 | 0.89 | 0.92 | 0.92 | 0.92 | 0.90 | 0.96 | 0.95 | 0.89 |
| =50 | 0.82 | 0.84 | 0.89 | 0.90 | 0.91 | 0.93 | 0.96 | 0.92 |
| =100 | 0.77 | 0.76 | 0.90 | 0.90 | 0.96 | 0.94 | 0.98 | 1.04 |
| =300 | 0.73 | 0.77 | 0.93 | 0.91 | 0.98 | 1.00 | 0.98 | 1.03 |
| =500 | 0.67 | 0.71 | 0.83 | 0.83 | 0.96 | 0.88 | 0.90 | 1.00 |
| 4-variate Normal | SCM | Tyler | -H | -M | -P | -H | -M | -P |
| =20 | 0.82 | 0.84 | 0.87 | 0.90 | 0.91 | 0.89 | 0.93 | 0.89 |
| =50 | 0.80 | 0.81 | 0.87 | 0.88 | 0.88 | 0.88 | 0.92 | 0.88 |
| =100 | 0.68 | 0.71 | 0.80 | 0.85 | 0.91 | 0.82 | 0.86 | 0.92 |
| =300 | 0.61 | 0.63 | 0.82 | 0.85 | 0.93 | 0.86 | 0.91 | 0.96 |
| =500 | 0.60 | 0.64 | 0.77 | 0.80 | 0.90 | 0.82 | 0.86 | 0.96 |
| PD | HSD | |||||||
|---|---|---|---|---|---|---|---|---|
| Distribution | ||||||||
| 4.73 | 3.99 | 3.46 | 3.26 | 4.18 | 3.63 | 3.36 | 3.15 | |
| 2.97 | 3.28 | 2.49 | 2.36 | 2.59 | 2.45 | 2.37 | 2.32 | |
| 1.45 | 1.47 | 1.49 | 1.52 | 1.30 | 1.37 | 1.43 | 1.49 | |
| 1.15 | 1.19 | 1.23 | 1.27 | 1.01 | 1.10 | 1.17 | 1.24 | |
| 0.97 | 1.02 | 1.07 | 1.11 | 0.85 | 0.94 | 1.02 | 1.08 | |
| MVN | 0.77 | 0.84 | 0.89 | 0.93 | 0.68 | 0.77 | 0.84 | 0.91 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On Weighted Multivariate Sign Functions
Subhabrata Majumdar
and
Snigdhansu Chatterjee
University of Minnesota, Minneapolis, MN, USA Currently at AT&T Labs Research. Email: [email protected] author. Email: [email protected]
Abstract: Multivariate sign functions are often used for robust estimation and inference. We propose using data dependent weights in association with such functions. The proposed weighted sign functions retain desirable robustness properties, while significantly improving efficiency in estimation and inference compared to unweighted multivariate sign-based methods. Using weighted signs, we demonstrate methods of robust location estimation and robust principal component analysis. We extend the scope of using robust multivariate methods to include robust sufficient dimension reduction and functional outlier detection. Several numerical studies and real data applications demonstrate the efficacy of the proposed methodology.
Keywords: Multivariate sign, Principal component analysis, Data depth, Sufficient dimension reduction
1 Introduction
Given a point in a normed linear space with norm denoted by , the generalized sign function with center is defined as
[TABLE]
This is a functional and multivariate generalization of the real-valued sign function, that takes the values one, negative one or zero if the point is to the right, left or equal respectively. This generalized sign function was introduced by Möttönen and Oja, (1995) for , the -dimensional real Euclidean space.
The function maps to the origin and all other points of to the unit sphere . Given a dataset , that we collect together in the matrix , an approach for robust estimation and inference in multivariate data starts by evaluating (3) on each observation, thus defining with respect to some suitable center , and then using these for robust location and scale estimation and inference, including inference for (Locantore et al.,, 1999; Oja,, 2010; Wang et al.,, 2015). Suppose . If the data are independent, identically distributed (hereafter, i.i.d.) from an elliptically symmetric distribution, then the eigenvectors of and of are the same for suitable centering parameter , that is, the population principal components from the original data and from its sign transformations are the same (Taskinen et al.,, 2012). However, valuable information is lost in the form of magnitudes of sample points. As a result, spatial sign-based procedures suffer from low efficiency. For example, eigenvector estimates obtained from the covariance matrix of are asymptotically inadmissible (Magyar and Tyler,, 2014) and Tyler’s M-estimate of scatter (Tyler,, 1987) has uniformly lower asymptotic risk.
In this paper, we propose to alleviate this low efficiency problem, by associating a data-driven weight with the generalized sign , that can be used to adaptively trade-off between efficiency and robustness considerations in any given application. We demonstrate the utility of using the proposed weighted generalized sign functions in a number of problems of current interest, including robust estimation of location and scatter.
Specifically, we propose using product of the generalized sign function and a weight function derived as a transformation of a data-depth function (Serfling,, 2006; Zuo and Serfling,, 2000). Like data-depth functions, the weight functions used in this paper are non-negative reals defined over , where is a fixed family of probability measures. For every choice of parameters and , in this paper
[TABLE]
is used as a robust surrogate for observation . Notice that for the trivial choice , , we get , the original centered observations. With the other trivial choice of , we get the generalized sign . However, in this paper we illustrate how using other weight functions can lead to interesting robustness and efficiency trade-offs in a variety of situations. The various technical conditions and assumptions that we impose on the weight function are valid for weights derived from three well-known data depth functions: the half-space depth, the Mahalanobis depth, and the projection depth. Note that for i.i.d. data, there are three parameters involved here: the location parameter in (3), the distribution used for the weight function, and the distribution of . For clarity and to mirror the contexts of how data depth has been used in the literature (Liu et al.,, 1999; Serfling,, 2006), we fix for this paper, although in Section 2 we briefly remark on the case when these distributions are different. Also note that in many problems of interest, is unknown and data-depths are computed using , however, under very standard regularity conditions (for example, see assumptions B1-B3 below), the properties of and are close enough for both asymptotic theory and practical applicability.
We primarily focus on the task of robust dispersion/scatter estimation and robust principal component analysis in this paper. Figure 1 presents an illustrative example of bivariate data with outliers in the top left panel, where the outliers are marked with red points. In the other panels, the generalized sign values of the same data are presented as black points on the unit circle, with the outliers again marked with red points. Notice that the black points from either the top right or bottom panels have very similar eigenvector structure as the original data without the outliers. The green and blue triangles are examples of the proposed weighted sign values: the top right (respectively, bottom left, bottom right) panels depict these values where the weights have been generated using Mahalanobis depth (respectively Tukey’s half-space depth and the projection depth). The blue triangles are the weighted sign values of the outliers. Notice that the eigenvectors from the weighted signs also capture the pattern from the original data without the outliers.
There are two unknown quantities in the generalized sign function as defined in (3): and , To estimate dispersion and its eigen-structure robustly, we must start with a robust estimator for . In Section 2 we present the case for weighted spatial quantiles, which can be defined and studied in very general spaces . One special case of this is the weighted spatial median. As a location estimator, it has several interesting robustness properties and can be shown to be more efficient that some existing robust location estimators, thus making it a perfect candidate to estimate .
Following that, we restrict to the for fixed , and present detailed discussions on our primary proposal for a robust measure of dispersion in Section 3, followed by a proposed * affine equivariant version* of it in Section 4, robust estimation of eigenvalues and a third robust estimator for dispersion in Section 5, and a thorough study of robustness and efficiency using influence functions in Section 6. We then report multiple simulation-based numeric studies in Section 7, present several real data examples in Section 8, and concluding remarks in Section 9.
In the rest of this paper, all finite-dimensional vectors are column vectors, and for a vector or matrix , the notation stands for its transpose. The Gaussian distribution with mean and variance is denoted by . The identity matrix is denoted by , with or without a subscript to denote its dimension. The notations respectively stand for the inverse, determinant, minimum and maximum eigenvalues of a matrix , whenever these are well-defined. For a scalar or vector valued random variable , denotes its expected value, while denotes its variance or covariance matrix.
2 The Weighted Spatial Median
Suppose the open unit sphere in is given by , and let . We also fix the set of probability measures , and select . Consider a random element , and define the function \Phi(q;X,u,{\mathbb{F}})=W(X,{\mathbb{F}})\bigl{\{}|X-q|+\langle u,X-q\rangle\bigr{\}}. We define the -th weighted spatial quantile of as the minimizer of the expectation of , that is
[TABLE]
This is a natural generalization of the spatial median (Chaudhuri,, 1996; Haldane,, 1948; Koltchinskii,, 1997) ( and ), or more general spatial quantiles (Cardot et al.,, 2017; Chakraborty and Chaudhuri,, 2014; Minsker,, 2015) (). We assume that is convex in for -almost all values of .
In what follows, for brevity we elaborate only the case of the weighted spatial median (thus \Psi(q;0,{\mathbb{F}})={\mathbb{E}}\bigl{[}W(X,{\mathbb{F}})|X-q|\bigr{]}) for the case of finite dimensional . Specifically, we demonstrate the utility of using the weighted spatial median as opposed to using the traditional, unweighted versions found in literature. The analysis and computation of the general weighted spatial quantiles will largely follow by extending the results of the above cited references, and we postpone details to a future study.
Assume that we have independent and identically distributed observations , and the sample weighted spatial median is computed by minimizing , and is denoted by , the second subscript is in acknowledgement that the weight function is used. We denote the unweighted version of this estimator, ie, the case where as . Assume the following technical conditions:
(A1)
is finite for all and has a unique minimizer .
(A2)
is twice differentiable at and the second derivative is positive definite.
(A3)
exists for all in a neighborhood of , and we use the notations
[TABLE]
These assumptions are very broad-based and general. The first one essentially requires the existence of a population parameter, the second one requires that the minimization approach is meaningful in the population, and the third one essentially requires that the weight function has a finite variance. No further restrictions are placed on the tuning parameter or the choice of the weight function.
Theorem 2.1**.**
Under assumptions [A1]-[A3], we have
[TABLE]
Thus, under very standard regularity conditions, the sample weighted spatial median is consistent and is asymptotically normal. Theorem 2.1 can be proved in several different ways. Here we use techniques following Haberman, (1989); Niemiro, (1992). Specifically, following Theorem 4 in Niemiro, (1992), which traces back to Haberman, (1989) with slightly relaxed conditions, we get
[TABLE]
where is any measurable subgradient of . Theorem 2.1 follows by applying the central limit theorem.
Remark.
Note that the technical conditions for the result presented in Theorem 2.1 is one of several alternatives that can conceived, and the scope of this result is broader than what is presented above. First, note that if and are different and the weights are not a function of , a situation that may arise in hypothesis testing problems where the weights are based on the null distribution, the convexity of follows automatically and is not an assumption. Second, even if is not convex but sufficiently smooth, we can have a central limit theorem, for example, by using techniques similar to Chatterjee and Bose, (2005). Choices of other than , e.g. Majumdar and Chatterjee, (2018), may lead to interesting interpretations of and the resulting location estimator and will be explored further in future.
2.1 Asymptotic efficiency of weighted spatial median
Let be the asymptotic variance of from Theorem 2.1, where we use the subscript “W” to denote that this depends on the weight function. We use the notation for the case where , that is, all weights are one. The asymptotic relative efficiency of two statistics is the -th root of the reciprocals of their determinants. That is,
[TABLE]
One easy result from Theorem 2.1 is that under reasonable conditions the asymptotic relative efficiency of the weighted spatial median over the unweighted version is always greater than one. We document this in the following corollary:
Corollary 2.2**.**
Assume that the weight function is bounded above by some , and the matrices and are positive definite. Then
[TABLE]
Consequently, if then this asymptotic relative efficiency is larger than 1.
Proof of Corollary 2.2.
Using the facts that for square matrices and for non-singular , we write
[TABLE]
The result follows, using the facts that and , and the upper bound on . ∎
We may wish to further explore the conditions of Corollary 2.2. Let us concentrate on the case of where the distribution of is spherically symmetric. Following Fang et al., (1990):
Definition 2.3**.**
A -dimensional random vector is said to elliptically distributed if there exist a vector , a positive semi-definite matrix and a function such that the characteristic function of is for .
There are several alternative formulations, see the above reference and citations within it for details.
Corollary 2.4**.**
Assume that is an elliptically symmetric distribution, with location parameter and the covariance matrix satisfies the following conditions:
, 2. 2.
.
Also assume that the weight function is (a) affine invariant, i.e. for , and (b) satisfies the following:
[TABLE]
Then we have .
The conditions 1 and 2 in Corollary 2.4 are due to Wang et al., (2015), and ensure that as diverges with , the eigenvalues of are bounded away from 0 and . Corollary 2.4 can be proved using Corollary 2.2, the fact that for elliptical distributions is non-singular (Taskinen et al.,, 2012), then using and expanding upon Lemma A.5 in Wang et al., (2015). We give the technical details in the supplementary material.
Obtaining affine invariant weight functions is not challenging. Most functions arising in the context of data depths are affine invariant. Corollary 2.4 implies that there is a wide frame of distributions and choices of weight functions where there is a benefit to considering weighted spatial medians. In fact, Corollary 2.4 can be established under just location invariance condition on weights. We use the stronger affine invariance condition here only because the subsequent sections use weights based on affine invariant data depth functions. Also note that in case (6) does not hold for the original form of a weight function, one can scale all weights by an appropriate constant to reduce the value in the left hand side of (6) to satisfy the condition.
In actual computations, in place of we propose using where is the empirical distribution function. Up to first order asymptotics, the analysis remain unchanged from the above for this modified weight function.
2.2 Examples of affine invariant weights
We now illustrate some specific choices of weight functions that are compatible with the conditions of Corollary 2.4 and the results in the rest of this paper. These arise as easy transformations of data-depth functions. A data depth function is defined on , where is a fixed set of probability measures. The main property of a data-depth function is that for every probability measure , there exists a constant such that for any and ,
[TABLE]
That is, for every fixed , the data-depth function achieves a supremum at , and is non-decreasing in every direction away from , thus providing a center-outward partial ordering of points in . There are generally several algebraic and analytic properties assumed for data-depth functions to elicit interesting mathematical properties, see for example Serfling, (2006); Zuo and Serfling, (2000) for details.
The spherically symmetric case of an elliptical distribution is realized with for some . We fix the notation , and let . Note that is a spherically symmetric distribution and hence depends only on , and and . Taking affine invariant data depth functions as weights ensures that . It is now easy to show that results in this paper are valid for the weight functions (with appropriate scaling to satisfy (6)) () derived from the half-space depth, () derived from the Mahalanobis depth, and () , where stands for median absolute deviation, derived from the projection depth. We omit the technical details. These three weight functions give a center-inward partial ordering, thus essentially quantifying peripherality instead of depth. Note however, that our results below are of much more general form, and these three special choices of weights only serve as important illustrative examples to achieve desirable robustness and efficiency balance in data analysis.
3 A robust measure of dispersion
From this section onwards, we assume that , that is, the support of the random variable under study is the -dimensional Euclidean plane, and the data are independent and identically distributed from an elliptical distribution with parameters and . We also assume that is absolutely continuous, with , and that is positive definite. This eliminates technicalities arising from rank deficient cases, and makes the weight functions affine invariant. Thus, we essentially restrict the rest of this paper to the same framework as in Corollary 2.4. Most of the results below generalize to the case where is a separable Hilbert space, however additional technicalities are involved, as in Bali et al., (2011), and will be considered in a future project.
3.1 The Weighted Sign Covariance Matrix
Consider the spectral decomposition of given by , where is an orthogonal matrix and is diagonal with positive diagonal elements . Also denote the -th eigenvector of by for . Thus, the -th column of is . In the rest of this paper we use the notation , and hence . Recall from Section 2 that we use the notation for the distribution of , and that is a spherically symmetric distribution and hence depends only on . Additionally, to simplify notations, for any random variable , we occasionally use the abbreviated notation . Note that is a random weight, and takes the same value as . Also, as noted in Corollary 2.4, is a function of only. We additionally assume that .
It is convenient to write
[TABLE]
where is a random variable uniformly distributed on the unit sphere and is another random variable independent of satisfying . Note that , and , . Then we have
[TABLE]
As a robust surrogate for , we consider the following random variable
[TABLE]
In samples, the equivalent for is for a suitable location estimator , for example, the weighted spatial median. We fix the notation , and define the following dispersion parameter:
[TABLE]
In the following Theorem, we establish that the eigenvectors of and are identical, although their eigenvalues may be different.
Theorem 3.1**.**
Under the conditions listed above, we have , where is a diagonal matrix. Thus, the eigenvectors of and are identical.
Proof of Theorem 3.1.
Fix any index . Consider the vector such that
[TABLE]
Then and have the same distribution, and note that almost surely. Consequently, for any we have
[TABLE]
Consequently, , as established in Theorem 1 of Taskinen et al., (2012).
Also, since the weight is a function of , we have that is independent of . Consequently, we have
[TABLE]
where is a diagonal matrix. ∎
3.2 Sample version of
We now discuss the properties of the sample version of , say computed from . In practice, we cannot obtain , and consequently use instead. We assume the following conditions:
(B1)
Bounded weights: The weights are bounded functions.
(B2)
Uniform convergence:
[TABLE]
almost surely as .
(B3)
Smoothness under perturbation: For all , there exists a , possibly depending on , such that for any
[TABLE]
In the above, denotes point mass at . These properties are easily satisfied for weight functions derived from standard depth functions, for example, , and discussed earlier.
The following result allows us to use the empirical, plug-in weights and an estimated location parameter in the weighted dispersion estimator. A natural choice for the location parameter estimator is the solution to , which is the same as the sample version of the weighted spatial median discussed in Section 2.
Lemma 3.2**.**
Assume that . Also assume that we have a location estimator satisfying . Then
[TABLE]
where for any such that for , we have .
Proof of Lemma 3.2.
Note that the moment condition above is satisfied by default since has an elliptical distribution (Dürre et al.,, 2014). The proof is mostly algebra, and we provide a sketch of the main arguments. We have
[TABLE]
Using the stated technical conditions, we can now show that for . For illutration, we present the case for below.
Notice that the -th element of is given by
[TABLE]
and hence
[TABLE]
Let H(X_{i})=|X_{i}-\mu|^{-2}\bigl{(}c^{T}(X_{i}-\mu)\bigr{)}^{2}, and notice that almost surely for . Now notice that conditional on except for a null set (possibly depending on ) we have . Thus, except for a null set , and the conclusion follows for this part.
The argument for follows a similar argument. ∎
Let be the vectorized version of . We are now in a position to state the result for consistency of the sample version of ,
Theorem 3.3**.**
Assume the conditions of Lemma 3.2. Then
[TABLE]
where .
The asymptotic normality follows from our assumptions and as a direct consequence of Lemma 3.2. Incidentally, an expression for can be explicitly obtained in terms of , and , but is algebraic in nature. We present it in the supplementary material.
We now use Theorem 3.3 to obtain consistency results for the eigenvectors obtained from . Suppose that are the eigenvalues of , which we assume are all distinct values.
Theorem 3.4**.**
Suppose the spectral decomposition of is given by . Then the matrix of centered and scaled eigenvectors and the vector of centered and scaled eigenvalues have independent distributions. The distribution of the random variable converges weakly to a -variate normal distribution with mean and the variance matrix whose -th block of entries are given by
[TABLE]
The distribution of converges weakly to a -dimensional normal distribution with mean and the variance-covariance matrix whose -the element is
[TABLE]
The proof of this result follows from using Theorem 3.3 and using techniques similar to a corresponding result in Taskinen et al., (2012). We omit the algebraic details here and put it in the supplementary material.
Recall that the asymptotic variance of the -th eigenvector of the sample covariance matrix, say is (Anderson,, 2009):
[TABLE]
Suppose is the -th eigenvector of , whose asymptotic behavior is presented above in Theorem 3.4.
This leads to the following useful result:
Corollary 3.5**.**
The asymptotic relative efficiency of , relative to , is given by
[TABLE]
The proof of this Corollary is immediate.
4 An affine equivariant robust measure of dispersion
A desirable invariance property of any dispersion parameter corresponding to a random variable is that under affine transformation the dispersion parameter scales to . It is clear that does not possess this property, since it remains unchanged for and for any scalar .
We follow the general framework of M-estimation with data-dependent weights (Huber,, 1981) to construct an affine equivariant counterpart of the . Specifically, we implicitly define
[TABLE]
To ensure existence and uniqueness of , consider the class of dispersion parameters that are obtained as solutions of the following equation:
[TABLE]
with . Under the following assumptions on the scalar valued functions and , the above equation produces a unique solution (Huber,, 1981):
(C1)
The function is monotone decreasing, and for ;
(C2)
The function is monotone decreasing, and for ;
(C3)
Both and are bounded and continuous,
(C4)
,
(C5)
For any hyperplane in the sample space , (i) and (ii) .
Putting things into context, in our case we have , . We proceed to verify the other conditions for the weight functions , and discussed earlier.
It is easy to verify that the resulting from the above choices satisfy (C1) and (C3). Note that is a finite positive constant, and (C2) and (C3) are also easily satisfied. Since in all the above cases, (C4) is also easy to check. Since is absolutely continuous, (C5) holds trivially.
To compute the sample version of , we solve (13) iteratively by obtaining a sequence of positive definite matrices until convergence. Thus, using the location estimator , we may iterate
[TABLE]
The asymptotic properties of can be obtained using methods similar to those of Section 3, and techniques presented in Croux and Haesbroeck, (2000) and elsewhere. We state the following result and omit its proof.
Theorem 4.1**.**
The asymptotic covariance matrix of an eigenvector of the sample affine equivariant scatter functional is given by
[TABLE]
where is the asymptotic variance of an off-diagonal element of when the underlying distribution is . It follows that if is the -th eigenvector of ,
[TABLE]
5 Robust estimation of eigenvalues, and a plug-in estimator of
As seen in Theorem 3.1, eigenvalues of the are not same as the population eigenvalues. In this section, we discuss on robust estimation of ’s using . Assume the data is centered, the robust estimator from Section 2 suffices. We start by computing the sample version and its spectral decomposition:
[TABLE]
We then use the following steps:
Randomly divide the sample indices into disjoint groups of size each. 2. 2.
Transform the data matrix: . 3. 3.
Calculate coordinate-wise variances for each group of indices :
[TABLE]
where is the vector of column-wise means of , the submatrix of with row indices in . 4. 4.
Obtain estimates of eigenvalues by taking coordinate-wise medians of these variances:
[TABLE]
We collect , in the diagonal matrix . The number of subgroups used to calculate this median-of-small-variances estimator can be determined following Minsker, (2015). There can be other ways of estimating the eigenvalues of using also, we will pursue such methods elsewhere. We construct a consistent plug-in estimator of the population covariance matrix as
[TABLE]
Let denote the Frobenius norm of a matrix , in other words, . The following result establishes that this is a consistent estimator of :
Theorem 5.1**.**
Suppose that as , and . Then we have
[TABLE]
Proof of Theorem 5.1.
This proof has many algebraic steps, and we sketch the main arguments below.
Suppose . Owing to the fact that the Frobenius norm is invariant under rotations and that is finite and fixed, it suffices to show that the off-diagonal elements of converge in probability to zero, and that the difference between the -th diagonal element of and converges to zero for any .
Now notice that from Theorem 3.4 we have that , where the -th element of the remainder satisfies . We can show, using standard algebra, that
[TABLE]
where the -th element of the remainder satisfies . This follows immediately from above, the fact that is finite and fixed, and all elements of are constants. This immediately establishes the case for the off-diagonal elements.
For the diagonal elements, notice that since , each coordinate-wise variance for each group of indices is a consistent estimator of . The result follows.
∎
6 Influence Functions of Dispersion Measures
We retain the framework adopted in Section 3, and discuss in this section the robustness and efficiency properties associated with and , and principal components derived therefrom. We do not discuss here, since the properties of that approach follow from those of . We additionally assume that the eigenvalues of are distinct, and given by , to avoid several additional technical conditions for the theoretical results to follow. The case where the eigenvalues of can have multiplicity greater than one requires no additional conceptual development, but does require considerable algebraic manipulations.
For studying the robustness aspect, we first present some results relating to influence functions in the current context. Influence functions quantify how much influence a sample point, especially an infinitesimal contamination, has on any functional of a probability distribution (Hampel et al.,, 1986). Given any probability distribution , the influence function of any point for some functional on the distribution is defined as
[TABLE]
where ; being the distribution with point mass at . When for some -integrable function , .
It now follows that
[TABLE]
Recall that are the eigenvalues of , which we assume are all distinct values.
Proposition 6.1**.**
The influence function of as follows:
[TABLE]
The proof of Proposition 6.1 follows from Sibson, (1979) and Croux and Haesbroeck, (2000), we omit the details.
If the weight function is a bounded function, as is the case of , , and , the influence function given in Proposition 6.1 is bounded, indicating good robustness properties of the principal component analysis.
We now derive the influence function for .
Proposition 6.2**.**
The influence function of is given by
[TABLE]
Proof of Proposition 6.2.
Let . As a first step, since is affine equivariant, we obtain from Croux and Haesbroeck, (2000) that
[TABLE]
From Lemma 1 of (Hampel et al.,, 1986), page 276, we obtain that there exist scalar valued functions and such that
[TABLE]
consequently we obtain
[TABLE]
∎
Suppose are the eigenvalues of , which we assume are all distinct values. Also denote the -th eigenvector of by for .
Proposition 6.3**.**
The influence function of may be obtained as
[TABLE]
We omit the proof of Proposition 6.3, which follows along similar lines to to rest of the computations of this section. It can be shown that when is a bounded function, is also bounded, along the lines of Huber, (1981), which in turn implies that the influence function for a principal component based on is also bounded.
7 Simulation Studies
We report a number of numerical simulation studies on several properties relating to and , and their eigenvalues and eigenvectors, on datasets with or without influential points, to illustrate the finite sample efficiency and robustness properties of the proposed weighted estimators. We compare these proposed estimators with techniques that exists in literature, specifically, the Sign Covariance Matrix (SCM) and Tyler’s shape matrix (Tyler,, 1987).
7.1 Efficiency of different robust estimators
We compare the performance of and with that of the SCM and Tyler’s scatter matrix. For this study, we fix the dimension . We consider six elliptical distributions, and from every distribution draw 10000 samples each for sample sizes and . All distributions are centered at , and have covariance matrix .
We use the concept of principal angles (Miao and Ben-Israel,, 1992) to find out error estimates for the first eigenvector of a scatter matrix. In our case, the first eigenvector is
[TABLE]
We measure the prediction error for an eigenvector estimator (say, ), using the smallest angle between the true and predicted vectors, i.e. . A small absolute value of this angle means to better prediction. We repeat this 10,000 times and calculate the Mean Squared Prediction Angle:
[TABLE]
where is the value of in the -th replication, . The finite sample efficiency of relative to that from the sample covariance matrix, i.e. is obtained as:
[TABLE]
The results from this simulation exercise are presented in Table 1. It can be seem that -based estimators (columns 3-5) outperform SCM and Tyler’s -estimator of scatter. Among the 3 depth functions considered, projection depth gives the best results. Its finite sample performances are better than Tyler’s and Huber’s M-estimators of scatter, as well as their symmetrized counterparts that are much more computationally intensive (see Table 4 in Sirkiä et al., (2007)). The affine equivariant -based estimators (columns 6-8) are even more efficient.
7.2 Influence function comparison
In Figure 2 we consider first eigenvectors of , the Sign Covariance Matrix (SCM) and Tyler’s shape matrix (Tyler,, 1987). We generate data from and set and plot norms of the eigenvector influence functions for different values of . Let us denote the -th eigenvector of the Sign Covariance Matrix and Tyler’s shape matrix by and , respectively. Their influence functions are given as follows:
[TABLE]
Panels (b) and (c) in Figure 2, corresponding to Sign Covariance Matrix and Tyler’s shape matrix respectively, exhibit an ‘inlier effect’, that is, points close to the center having high influence, which results in loss of efficiency. On the other hand, the influence function for eigenvector estimates of the sample covariance matrix (panel (a)) is unbounded and makes the corresponding estimates non-robust. In comparison, the corresponding to weights derived from projection depth, half-space depth and Mahalanobis depth have bounded influence functions and small values of the influence function at ‘deep’ points.
7.3 Efficiency of affine equivariant robust estimator
We study the finite sample efficiency properties of using a simulation exercise. We consider 6 different elliptic distributions, namely, the -variate multivariate Normal distribution and the multivariate distributions corresponding to degrees of freedom 5, 6, 10, 15 and 25. We compute the ARE of the estimator for the first eigenvector using , using weights based on the projection depth (PD) and the halfspace depth (HSD), thus this simulation is an illustration of how different choices of weights affect the results in the context of Theorem 4.1. We consider using the sample covariance matrix as the baseline method for this study. The ARE values are computed by using Monte-Carlo simulation of samples and subsequent numerical integration. We report the results of this exercise in Table 2. Based on these results, we notice that is particularly efficient in lower dimensions for distributions with heavier tails ( and ), while for distributions with lighter tails, the AREs increase with data dimension. At higher values of , note that is almost as efficient as the sample covarnace matrix even when the data comes from multivariate normal distribution.
7.4 Robust sufficient dimension reduction and supervised learning
One of the main usages of obtaining dispersion estimators and their eigenvalues and eigenvectors is in dimension reduction techniques. Examples of such uses are in principal component regression, partial least squares and envelope methods. We illustrate below the latter technique, in the context of sufficient dimension reduction (SDR). For details on envelope methods and other uses of robust estimators of dispersion and eigen-structures, see Cook et al., (2010); Adragni and Cook, (2009); Cook and Zhang, (2015) and references and citations of these articles. In the context of multivariate-response () linear regression, the envelope method proposes the model , where are independent mean zero Gaussian noise terms with covariance matrix whose spectral representation can be written as
[TABLE]
Thus, the eigenvectors of are partitioned into two blocks: and , and the regression coefficient of on is given by for some . Dimension reduction is achieved when , typically without extraneous assumptions like sparsity. The envelope model for generalized linear models is discussed in Adragni and Cook, (2009), and may be used for supervised learning. Nonlinear regression models may also be handled similarly.
Given a set of examples , an envelope-based prediction for the response for any may be obtained from
[TABLE]
The above assumes that the covariates come from the Gaussian distribution , and appropriate changes may be made for other distributions.
We design a robust version of the above, by using weighted spatial medians for location parameters corresponding to the distributions of and , and using the first eigenvectors of as . A robust location estimator for the distribution of is required for the estimation of . Details are available in Adragni and Cook, (2009).
In a non-linear regression model, we compare the performance of the robust version of SDR with the original method of Adragni and Cook, (2009) with or without the presence of bad leverage points in . For any given choice of covariate dimension , we take and , and generate the responses as independent standard normal, and as Normal with mean in each of the coordinates, and variance . We measure performances of the SDR models by their mean squared prediction error on another set of 200 observations generated similarly, and taking the average of these errors on 100 such training-test pairs of datasets. The above steps are repeated for the choices of .
Panel (a) of figure 3 compares prediction errors using both robust and maximum likelihood SDR estimates when the covariates contain no outliers: here the two methods are virtually indistinguishable. We then introduce outliers in each of the 100 datasets by adding 100 to first coordinates of the first 10 observed covariate values, and repeat the analysis. Panel (b) of the figure shows that the robust SDR method remains more accurate in predicting out of sample observations for all values of than the standard SDR.
8 Real Data Applications
We now present an application of our proposed approach to some real data problems.
Robust techniques are useful when in identifying outlying observations, and we illustrate below how to make use of our fixed-dimensional methods presented earlier for functional (and hence infinite-dimensional) data.
We follow the approach of Boente and Salibian-Barrera, (2015) for performing robust principal component analysis on functional data using the estimated eigenvectors from . Suppose the data consistents of curves, say , each observed at a set of common design points . We model each of these functions as a linear combination of mutually orthogonal B-spline basis functions . We map data for each of the functions onto the coordinate system formed by the spline basis:
[TABLE]
We then model the -th row of the matrix , denoted by as follows:
[TABLE]
where is a location parameter, is a loading matrix, is a score vector, and is the error term. We obtain robust estimators of , and consequently using . Define . The orthogonal distance (OD) corresponding to this projection is defined as
[TABLE]
Analogously, the score distance (SD) is defined as
[TABLE]
where are the top eigenvalues from . For outlier detection, following Hubert et al., (2005) we set the upper cutoff values for score distances at and orthogonal distances at
[TABLE]
where is the standard normal cumulative distribution function.
We apply the above outlier detection method on two data sets. First, we consider the monthly average sea surface temperature anomaly data from June 1970 to May 2004, available from http://www.cpc.ncep.noaa.gov/data, depicted in top-left panel of Figure 4)). Second, we consider the octane data, which consists of 226 variables and 39 observations (Esbensen et al.,, 1994). Each sample is a gasoline compound with a certain octane number, and has its NIR absorbance spectra measured in 2 nm intervals between 1100 - 1550 nm. There are 6 outliers here: compounds 25, 26 and 36-39, which contain alcohol. This data is presented in top-right panel of Figure 4)).
In the sea surface temperature data, using a cubic spline basis with knots at alternate months starting in June, we get a close approximation as depicted in middle-left panel of Figure 4. Using our proposed methodology with results in two points having their SD and OD larger than cutoff, depicted in bottom-left panel of Figure 4. These points correspond to the time periods June 1982 to May 1983 and June 1997 to May 1998 are marked by black curves in panels a and c, and pinpoint the two seasons with strongest El-Niño events.
On the octane data, we use the same methodology, and again the top robust PC turns out to be sufficient in identifying all 6 outliers. Details are available in the right side panels of Figure 4.
9 Conclusions
We propose the use of a weighted multivariate sign transformation for robust estimation and inference, and as demonstrated by theoretical results and several simulation studies and data examples, in many situations using a data-depth driven weight function leads to considerable efficiency gain without compromising on robustness properties. Our methodology seems to suggest new ways of identifying El-Niño or La-Niña events from the sea-surface temperature anomaly data, which will be studied further later.
Several of our results stated above are for data from the Euclidean space , where is fixed. The cases where increases with sample size and may be higher than sample size, and where data are from a separable Hilbert space, will be considered in a future work. There are only few conceptual challenges to such extensions, however, there are several technical and algebraic challenges.
Acknowledgements
The research of SC is partially supported by the National Science Foundation (NSF) under grants # DMS-1622483, # DMS-1737918.
Supplementary material
Appendix A Form of
First observe that for having covariance matrix ,
[TABLE]
where is the covariance matrix of , the elliptic distribution with mean and covariance matrix . Now,
[TABLE]
The matrix consists of elements at position of the block, and 0 otherwise. These positions correspond to variance and covariance components of on-diagonal elements. For the expectation matrix, all its elements are of the form , with . Since is even in , which has a spherically symmetric distribution, all such expectations will be 0 unless are all equal or pairwise equal. Following a similar derivation for spatial sign covariance matrices in Magyar and Tyler, (2014), we collect the non-zero elements and write the matrix of expectations:
[TABLE]
where with having 1 as -th element and 0 elsewhere, and .
Putting everything together, denote by the sample version of , the weighted covariance matrix obtained from , i.e. . Then the different types of elements in the matrix are as given below ():
- •
Variance of on-diagonal elements
[TABLE]
- •
Variance of off-diagonal elements ()
[TABLE]
- •
Covariance of two on-diagonal elements ()
[TABLE]
- •
Covariance of two off-diagonal elements ()
[TABLE]
- •
Covariance of one off-diagonal and one on-diagonal element ()
[TABLE]
The above give all the elements of . We plug these in (17) to recover .
Appendix B Proofs
Proof sketch of Corollary 2.4.
We start with slightly modified versions of Lemmas A.4 and A.5 in Wang et al., (2015):
Lemma B.1**.**
Given that condition 1 in Corollary 2.4 holds, we have .
Lemma B.2**.**
Define . Then and . Further, if conditions 1 and 2 of Corollary 2.4 hold then .
The lemmas can be proved using similar steps as the proofs of the original lemmas in Wang et al., (2015) and using the upper bound on the weight function. Corollary 2.4 is now proved by applying Corollary 2.2, plugging in upper bound of from Lemma B.1 and lower bound of from Lemma B.2 into the ARE expression. ∎
Proof of Theorem 3.4.
We suppose . In spirit, this corollary is similar to Theorem 13.5.1 in Anderson, (2009). We start with the following result, due to (Taskinen et al.,, 2012), allows us to obtain asymptotic joint distributions of eigenvectors and eigenvalues of , provided we know the limiting distribution of itself:
Theorem B.3**.**
Let be defined as before, and be any positive definite symmetric matrix such that at the limiting distribution of is a -variate (singular) normal distribution with mean zero. Write the spectral decomposition of as . Then the limiting distributions of and are multivariate (singular) normal and
[TABLE]
The first matrix picks only off-diagonal elements of the left-hand side and the second one only diagonal elements. We shall now use this as well as the form of the asymptotic covariance matrix of the vectorized , i.e. to obtain limiting variance and covariances of eigenvalues and eigenvectors.
Due to the decomposition (18) we have, for , the following relation between any off-diagonal element of and the corresponding element in the estimate of eigenvectors, say :
[TABLE]
So that for eigenvector estimates of the original we have
[TABLE]
Now and for , so the above equation implies
[TABLE]
For the covariance terms, from (19) we get, for ,
[TABLE]
The exact forms given in the statement of the corollary now follows from the Form of in Section A.
For the on-diagonal elements of , using Theorem B.3 we have for the eigenvalue of , say ,
[TABLE]
for . Hence
[TABLE]
A similar derivation gives the expression for . Finally, since the asymptotic covariance between an on-diagonal and an off-diagonal element of , it follows that the elements of and diagonal elements of are independent. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adragni and Cook, (2009) Adragni, K. P. and Cook, R. D. (2009). Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences , 367(1906):4385 – 4405.
- 2Anderson, (2009) Anderson, T. (2009). An Introduction to Multivariate Statistical Analysis . Wiley, Hoboken, NJ, 3 edition.
- 3Bali et al., (2011) Bali, J. L., Boente, G., Tyler, D. E., and Wang, J.-L. (2011). Robust functional principal components: A projection-pursuit approach. The Annals of Statistics , 39(6):2852 – 2882.
- 4Boente and Salibian-Barrera, (2015) Boente, G. and Salibian-Barrera, M. (2015). s 𝑠 s -estimators for functional principal component analysis. Journal of the American Statistical Association , 110(511):1100 – 1111.
- 5Cardot et al., (2017) Cardot, H., Cénac, P., and Godichon-Baggioni, A. (2017). Online estimation of the geometric median in Hilbert spaces: Nonasymptotic confidence balls. The Annals of Statistics , 45(2):591 – 614.
- 6Chakraborty and Chaudhuri, (2014) Chakraborty, A. and Chaudhuri, P. (2014). The spatial distribution in infinite dimensional spaces and related quantiles and depths. The Annals of Statistics , 42(3):1203 – 1231.
- 7Chatterjee and Bose, (2005) Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. The Annals of Statistics , 33(1):414 – 436.
- 8Chaudhuri, (1996) Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association , 91(434):862 – 872.
