Discriminant analysis in small and large dimensions
Taras Bodnar, Stepan Mazur, Edward Ngailo, Nestor Parolya

TL;DR
This paper analyzes the distribution and performance of linear discriminant analysis in both small and high-dimensional settings, providing new asymptotic results and efficient error rate computations.
Contribution
It introduces a stochastic representation for discriminant coefficients and derives their asymptotic distribution in high dimensions, enhancing understanding of LDA's behavior.
Findings
Asymptotic distribution of discriminant coefficients derived
Efficient computation of error rates established
Comparison with optimal error rates in known-parameter scenarios
Abstract
We study the distributional properties of the linear discriminant function under the assumption of normality by comparing two groups with the same covariance matrix but different mean vectors. A stochastic representation for the discriminant function coefficients is derived which is then used to obtain their asymptotic distribution under the high-dimensional asymptotic regime. We investigate the performance of the classification analysis based on the discriminant function in both small and large dimensions. A stochastic representation is established which allows to compute the error rate in an efficient way. We further compare the calculated error rate with the optimal one obtained under the assumption that the covariance matrix and the two mean vectors are known. Finally, we present an analytical expression of the error rate calculated in the high-dimensional asymptotic regime. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
**Discriminant analysis in small and large dimensions
** Taras Bodnara,111Corresponding Author: Taras Bodnar. E-Mail: [email protected]. Tel: +46 8 164562. Fax: +46 8 612 6717. This research was partly supported by the Swedish International Development Cooperation Agency (SIDA) through the UR-Sweden Programme for Research, Higher Education and Institutional Advancement. Stepan Mazur acknowledges financial support from the project ”Ambit fields: probabilistic properties and statistical inference” funded by Villum Fonden, Stepan Mazurb, Edward Ngailoa and Nestor Parolyac
a* Department of Mathematics, Stockholm University, Roslagsvägen 101, SE-10691 Stockholm, Sweden
b Unit of Statistics, School of Business, Örebro University, Fakultetsgatan 1, SE-70182 Örebro, Sweden *
c* Institute of Statistics, Leibniz University Hannover, Königsworther Platz 1, D-30167 Hannover, Germany *
Abstract
We study the distributional properties of the linear discriminant function under the assumption of normality by comparing two groups with the same covariance matrix but different mean vectors. A stochastic representation for the discriminant function coefficients is derived which is then used to obtain their asymptotic distribution under the high-dimensional asymptotic regime. We investigate the performance of the classification analysis based on the discriminant function in both small and large dimensions. A stochastic representation is established which allows to compute the error rate in an efficient way. We further compare the calculated error rate with the optimal one obtained under the assumption that the covariance matrix and the two mean vectors are known. Finally, we present an analytical expression of the error rate calculated in the high-dimensional asymptotic regime. The finite-sample properties of the derived theoretical results are assessed via an extensive Monte Carlo study.
ASM Classification: 62H10, 62E15, 62E20, 60F05, 60B20
Keywords: discriminant function, stochastic representation, large-dimensional asymptotics, random matrix theory, classification analysis
1 Introduction
In the modern world of science and technology, high-dimensional data are present in various fields such as finance, environment science and social sciences. In the sense of many complex multivariate dependencies observed in data, formulating correct models and developing inferential procedures are the major challenges. The traditional multivariate analysis considers fixed or small sample dimensions, while sample sizes approaching to infinity. However, its methods cannot longer be used in the high-dimensional setting where the dimension is not treated as fixed but it is allowed to be comparable to the sample size.
The covariance matrix is one of the mostly used way to capture the dependence between variables. Although its application is restricted only to linear dependence and more sophisticated methods, like copula, should be applied in the general case, modeling dynamics in the covariance matrix is still a very popular subject in both statistics and econometrics. Recently, a number of papers have been published which deal with estimating the covariance matrix (see, e.g., Ledoit and Wolf, (2003), Cai and Liu, 2011a , Cai et al., (2011), Agarwal et al., (2012), Fan et al., (2008), Fan et al., (2013), Bodnar et al., (2014, 2016)) and testing its structure (see, e.g., Johnstone, (2001), Bai et al., (2009), Chen et al., (2010), Cai and Jiang, (2011), Jiang and Yang, (2013), Gupta and Bodnar, (2014)) in large dimension.
In many applications, the covariance matrix is accompanied by the mean vector. For example, the product of the inverse sample covariance matrix and the difference of the sample mean vectors is present in the discriminant function where a linear combination of variables (discriminant function coefficients) is determined such that the standardized distance between the groups of observations is maximized. A second example arises in portfolio theory, where the vector of optimal portfolio weights is proportional to the products of inverse sample covariance matrix and the sample mean vector (see Bodnar and Okhrin, (2011)).
The discriminant analysis is a multivariate technique concerned with separating distinct sets of objects (or observations) (Johnson et al., (2007)). Its two main tasks are to distinguish distinct sets of observations and to allocate new observations to previously defined groups (Rencher and Christensen, (2012)). The main methods of the discriminant analysis are the linear discriminant function and the quadratic discriminant function. The linear discriminant function is a generalization of Fisher linear discriminant analysis, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more groups of objects in the best way. The application of the linear discriminant function is restricted to the assumption of the equal covariance matrix in the groups to be separated. Although the quadratic discriminant function can be used when the latter assumption is violated, its application is more computational exhaustive, needs to estimate the covariance matrices of each group, and requires more observations than in the case of linear discriminant function (Narsky and Porter, (2013)). Moreover, the decision boundary is easy to understand and to visualize in high-dimensional settings, if the linear discriminant function is used.
The discriminant analysis is a well established topic in multivariate statistics. Many asymptotic results are available when the sample sizes of groups to be separated are assumed to be large, while the number of variables is fixed and significantly smaller than the sample size (see, e.g., Muirhead, (1982), Rencher and Christensen, (2012)). However, these results cannot automatically be transferred when the number of variables is comparable to the sample size which is known in the statistical literature as the high-dimensional asymptotic regime. It is remarkable that in this case the results obtained under the standard asymptotic regime can deviate significantly from those obtained under the high-dimensional asymptotics (see, e.g., Bai and Silverstein, (2010)). Fujikoshi and Seo, (1997) provided an asymptotic approximation of the linear discriminant function in high dimension by considering the case of equal sample sizes and compared the results with the classical asymptotic approximation by Wyman et al., (1990). For the samples of non-equal sizes, they pointed out that the high-dimensional approximation is extremely accurate. However, Tamatani, (2015) showed that the Fisher linear discriminant function performs poorly due to diverging spectra in the case of large-dimensional data and small sample sizes. Bickel and Levina, (2004), Srivastava and Kubokawa, (2007) investigated the asymptotic properties of the linear discriminant function in high dimension, while modifications of the linear discriminant function can be found in Cai and Liu, 2011b , Shao et al., (2011). The asymptotic results for the discriminant function coefficients in matrix-variate skew models can be found in Bodnar et al., 2017b .
We contribute to the statistical literature by deriving a stochastic representation of the discriminant function coefficient and the classification rule based on the linear discriminant function. These results provide us an efficient way of simulating these random quantities and they are also used in the derivation of their high-dimensional asymptotic distributions, using which the error rate of the classification rule based on the linear discriminant function can be easily assessed and the problem of the increasing dimensionality can be visualized in a simple way.
The rest of the paper is organized as follows. The finite-sample properties of the discriminant function are presented in Section 2.1, where, in particular we derive a stochastic representation for the discriminant function coefficients. In Section 2.2, an exact one-sided test for the comparison of the population discriminant function coefficients is suggested, while a stochastic representation for the classification rule is obtained in Section 2.3. The finite-sample results are then use to derive the asymptotic distributions of the discriminant function coefficients and of the classification rule in Section 3, while finite sample performance of the asymptotic distribution is analysed in Section 3.2.
2 Finite-sample properties of the discriminant function
Let and be two independent samples from the multivariate normal distributions which consist of independent and identically distributed random vectors with \mathbf{x}_{i}^{(1)}\sim\mathcal{N}_{p}(\mbox{\boldmath\mu}_{1},\mathbf{\Sigma}) for and \mathbf{x}_{j}^{(2)}\sim\mathcal{N}_{p}(\mbox{\boldmath\mu}_{2},\mathbf{\Sigma}) for where is positive definite. Throughout the paper, denotes the -dimensional vector of ones, is the identity matrix, and the symbol stands for the Kronecker product.
Let and be observation matrices. Then the sample estimators for the mean vectors and the covariance matrices constructed from each sample are given by
[TABLE]
The pooled estimator for the covariance matrix, i.e., an estimator for obtained from two samples, is then given by
[TABLE]
The following lemma (see, e.g., (Rencher and Christensen,, 2012, Section 5.4.2)) presents the joint distribution of , and .
Lemma 1**.**
Let \mathbf{X}_{1}\sim\mathcal{N}_{p,n_{1}}\left(\mbox{\boldmath\mu}_{1}\mathbf{1}_{n_{1}}^{T},\mathbf{\Sigma}\otimes\mathbf{I}_{n_{1}}\right) and \mathbf{X}_{2}\sim\mathcal{N}_{p,n_{2}}\left(\mbox{\boldmath\mu}_{2}\mathbf{1}_{n_{2}}^{T},\mathbf{\Sigma}\otimes\mathbf{I}_{n_{2}}\right) for . Assume that and are independent. Then
- (a)
\bar{\mathbf{x}}^{(1)}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{1},\frac{1}{n_{1}}\mathbf{\Sigma}\right), 2. (b)
\bar{\mathbf{x}}^{(2)}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{2},\frac{1}{n_{2}}\mathbf{\Sigma}\right), 3. (c)
,
Moreover, , and are mutually independently distributed.
The results of Lemma 1, in particular, implies that
[TABLE]
which is independent of .
2.1 Stochastic representation for the discriminant function coefficients
The discriminant function coefficients are given by the following vector
[TABLE]
which is the sample estimator of the population discriminant function coefficient vector expressed as
[TABLE]
We consider a more general problem by deriving the distribution of linear combinations of the discriminant function coefficients. This result possesses several practical application: (i) it allows a direct comparison of the population coefficients in the discriminant function by deriving a corresponding statistical test; (ii) it can be used in the classification problem where providing a new observation vector one has to decide to which of two groups the observation vector has to be ordered.
Let be a matrix of constants such that . We are then interested in
[TABLE]
Choosing different matrices we are able to provide different inferences about the linear combinations of the discriminant function coefficients. For instance, if and is the vector with all elements zero except the one on the th position which is one, then we get the distribution of the th coefficient in the discriminant function. If we choose and , then we analyse the difference between the first two coefficients in the discriminant function. The corresponding result can be further used to test if the population counterparts to these coefficients are zero or not. For several linear combinations of the discriminant function coefficients are considered simultaneously.
In the next theorem we derive a stochastic representation for \hat{}\mbox{\boldmath\theta}. The stochastic representation is a very important tool in analysing the distributional properties of random quantities. It is widely spread in the computation statistics (e.g., Givens and Hoeting, (2012)), in the theory of elliptical distributions (see, Gupta et al., (2013)) as well as in Bayesian statistics (cf., Bodnar et al., 2017a ). Later on, we use the symbol to denote the equality in distribution.
Theorem 1**.**
Let be an arbitrary matrix of constants such that . Then, under the assumption of Lemma 1 the stochastic representation of \hat{}\mbox{\boldmath\theta}=\mathbf{L}\hat{\mathbf{a}} is given by
[TABLE]
where ; , \breve{\mathbf{x}}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2},\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)\mathbf{\Sigma}\right), and . Moreover, , and are mutually independent.
Proof.
From Lemma 1.(c) and Theorem 3.4.1 of Gupta and Nagar, (2000) we obtain that
[TABLE]
Also, since and are independent, the conditional distribution of
\hat{}\mbox{\boldmath\theta}=\mathbf{L}\mathbf{S}^{-1}_{pl}\breve{\mathbf{x}} given equals to the distribution of \mbox{\boldmath\theta}^{*}=\mathbf{L}\mathbf{S}^{-1}_{pl}\breve{\mathbf{x}}^{*} and it can be rewritten in the following form
[TABLE]
Applying Theorem 3.2.12 of Muirhead, (1982) we obtain that
[TABLE]
and its distribution is independent of . Hence,
[TABLE]
and , are independent.
Using Theorem 3 of Bodnar and Okhrin, (2008) we get that is independent of for given . Therefore, is independent of and, respectively, is independent of . Furthermore, from the proof of Theorem 1 of Bodnar and Schmid, (2008) it holds that
[TABLE]
with .
Thus, we obtain the following stochastic representation of \hat{}\mbox{\boldmath\theta} which is given by
[TABLE]
where ; , \breve{\mathbf{x}}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2},\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)\mathbf{\Sigma}\right), and . Moreover, , and are mutually independent. The theorem is proved. ∎
In the next corollary we consider the special case when , that is, when is a -dimensional vector of constants.
Corollary 1**.**
Let and let be a -dimensional vector of constants. Then, under the condition of Theorem 1, the stochastic representation of is given by
[TABLE]
where , , (non-central -distribution with and degrees of freedom and non-centrality parameter ) with ; , and are mutually independently distributed.
Proof.
From Theorem 1 we get that
[TABLE]
where ; , , and \breve{\mathbf{x}}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2},\lambda\mathbf{\Sigma}\right) with ; , and are mutually independent.
Because \breve{\mathbf{x}}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2},\lambda\mathbf{\Sigma}\right), , and , the application of Corollary 5.1.3a of Mathai and Provost, (1992) leads to
[TABLE]
where \delta^{2}=(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})^{T}\mathbf{R}_{\mathbf{l}}(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})/\lambda. Moreover, since , the application of Theorem 5.5.1 of Mathai and Provost, (1992) proves that and are independently distributed.
Finally, we note that the random variable has the following stochastic representation
[TABLE]
where and ; and are independent. Hence,
[TABLE]
where
[TABLE]
Putting all above together we get the statement of the corollary. ∎
2.2 Test for the population discriminant function coefficients
One of the most important questions when the discriminant analysis is performed is to decide which coefficients are the most influential in the decision. Several methods exist in the literature with the following three approaches to be the most popular (c.f., (Rencher and Christensen,, 2012, Section 5.5)): (i) standardized coefficients; (ii) partial -values; (iii) correlations between the variables and the discriminant function. (Rencher,, 1998, Theorem 5.7A) argued that each of this three methods has several drawbacks. For instance, the correlations between the variables and the discriminant function do not show the multivariate contribution of each variable, but provide only univariate information how each variable separates the groups, ignoring the presence of other variables.
In this section, we propose an alternative approach based on the statistical hypothesis test. Namely, exact statistical tests will be derived on the null hypothesis that two population discriminant function coefficients are equal (two-sided test) as well as on the alternative hypothesis that a coefficient in the discriminant function is larger than another one (one-sided test). The testing hypothesis for the equality of the -th and the -th coefficients in the population discriminant function is given by
[TABLE]
while in the case of one-sided test we check if
[TABLE]
In both cases the following test statistic is suggested
[TABLE]
with
[TABLE]
The distribution of follows from (Bodnar and Okhrin,, 2011, Theorem 6) and it is summarized in Theorem 2.
Theorem 2**.**
Let and let be a -dimensional vector of constants. Then, under the condition of Theorem 1,
- (a)
the density of is given by
[TABLE]
with , \eta=\frac{\mathbf{l}^{T}\mathbf{\Sigma}^{-1}(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})}{\sqrt{\mathbf{l}^{T}\mathbf{\Sigma}^{-1}\mathbf{l}}}, and s=(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})^{T}\mathbf{R}_{\mathbf{l}}(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2}); the symbol denotes the density of the distribution . 2. (b)
Under the null hypothesis it holds that and is independent of .
Theorem 2 shows that the test statistics has a standard -distribution under the null hypothesis. As a result, the suggested test will reject the null hypothesis of the two-sided test (19) as soon as .
The situation is more complicated in the case of the one-sided test (20). In this case the maximal probability of the type I error has to be control. For that reason, we first calculate the probability of rejection of the null hypothesis for all possible parameter values and after that we calculate its maximum for the parameters which correspond to the null hypothesis in (20). Since the distribution of depends on \mbox{\boldmath\mu}_{1}, \mbox{\boldmath\mu}_{2}, and only over and (see, Theorem 2), the task of finding the maximum is significantly simplified. Let denotes the distribution function of the distribution . For any constant , we get
[TABLE]
where the last equality follows from the fact that the distribution function of the non-central -distribution is a decreasing function in non-centrality parameter and . Consequently, we get and the one-sided test rejects the null hypothesis in (20) as soon as .
2.3 Classification analysis
Having a new observation vector , we classify it to one of the considered two groups. Assuming that no prior information is available about the classification result, i.e. the prior probability of each group is , the decision which is based on the optimal rule is to assign the observation vector to the first group as soon as the following inequality holds (c.f., (Rencher,, 1998, Section 6.2))
[TABLE]
and to the second group otherwise. The error rate is defined as the probability of classifying the observation into one group, while it comes from another one. Rencher, (1998) presented the expression of the error rate expressed as
[TABLE]
where denotes the distribution function of the standard normal distribution.
In practice, however, \mbox{\boldmath\mu}_{1}, \mbox{\boldmath\mu}_{2}, and are unknown quantities and the decision is based on the inequality
[TABLE]
instead. Next, we derive the error rate of the decision rule (24). Let
[TABLE]
In Theorem 3 we present the stochastic representation of .
Theorem 3**.**
Let . Then, under the condition of Theorem 1, the stochastic representation of is given by
[TABLE]
where with and , , , ; , are independent of where and are independent as well.
Proof.
Let \mathbf{x}\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{i},\mathbf{\Sigma}\right), Since , , , and are independently distributed, we get that the conditional distribution of given and is equal to the distribution of defined by
[TABLE]
where \tilde{\mathbf{x}}=\mathbf{x}-\frac{1}{2}(\bar{\mathbf{x}}^{(1)}_{0}+\bar{\mathbf{x}}^{(2)}_{0})\sim\mathcal{N}_{p}\left(\mbox{\boldmath\mu}_{i}-\frac{1}{2}(\bar{\mathbf{x}}^{(1)}_{0}+\bar{\mathbf{x}}^{(2)}_{0}),\mathbf{\Sigma}\right), , and are independent.
Following the proof of Corollary 1, we get
[TABLE]
where u\sim\mathcal{F}\left(p-1,n_{1}+n_{2}-p,\left(\mbox{\boldmath\mu}_{i}-\frac{1}{2}(\bar{\mathbf{x}}^{(1)}_{0}+\bar{\mathbf{x}}^{(2)}_{0})\right)^{T}\mathbf{R}_{0}\left(\mbox{\boldmath\mu}_{i}-\frac{1}{2}(\bar{\mathbf{x}}^{(1)}_{0}+\bar{\mathbf{x}}^{(2)}_{0})\right)\right) with , , and ; , and are mutually independently distributed.
In using that
[TABLE]
and , we get
[TABLE]
where u|\bar{\mathbf{x}}^{(1)},\bar{\mathbf{x}}^{(2)}\sim\mathcal{F}\left(p-1,n_{1}+n_{2}-p,\left(\bar{\mathbf{x}}^{(i)}-\mbox{\boldmath\mu}_{i}\right)^{T}\mathbf{R}_{\mathbf{x}}\left(\bar{\mathbf{x}}^{(i)}-\mbox{\boldmath\mu}_{i}\right)\right) with , , and ; , are independent of .
Since and are independent and normally distributed, we get that
[TABLE]
and, consequently,
[TABLE]
where we used that .
The application of Theorem 5.5.1 in Mathai and Provost, (1992) shows that given the random variables (\bar{\mathbf{x}}^{(1)}-\bar{\mathbf{x}}^{(2)})^{T}\mathbf{\Sigma}^{-1}(\bar{\mathbf{x}}^{(i)}-\mbox{\boldmath\mu}_{i}) and (\bar{\mathbf{x}}^{(i)}-\mbox{\boldmath\mu}_{i})\mathbf{R}_{\mathbf{x}}(\bar{\mathbf{x}}^{(i)}-\mbox{\boldmath\mu}_{i}) are independently distributed with
[TABLE]
and, by using Corollary 5.1.3a of Mathai and Provost, (1992),
[TABLE]
with
[TABLE]
where we use that and \mathbf{R}_{\mbox{\boldmath\mu}}=\mathbf{\Sigma}^{-1}-\mathbf{\Sigma}^{-1}(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})^{T}\mathbf{\Sigma}^{-1}/(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})^{T}\mathbf{\Sigma}^{-1}(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2}).
As a result, we get
[TABLE]
where , with , , and ; , are independent of .
Finally, it holds with \Delta^{2}=(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2})^{T}\mathbf{\Sigma}^{-1}(\mbox{\boldmath\mu}_{1}-\mbox{\boldmath\mu}_{2}) that
[TABLE]
where both summands are independent following Theorem 5.5.1 in Mathai and Provost, (1992). The application of Corollary 5.1.3a in Mathai and Provost, (1992) leads to
[TABLE]
and
[TABLE]
From the last statement we get the stochastic representation of expressed as
[TABLE]
where with and , , , ; , are independent of where and are independent as well. ∎
Theorem 3 shows that the distribution of is determined by six random variables , , and . Moreover, it depends on \mbox{\boldmath\mu}_{1},\mbox{\boldmath\mu}_{2}, and only via the quadratic form . As a result, the the error rate based on the decision rule (24) is a function of only and it is calculated by
[TABLE]
The two probabilities in (2.3) can easily be approximated for all , , , and with high precision by applying the results of Theorem 3 via the following simulation study
- (i)
Fix and . 2. (ii)
Generate four independent random variables , , , and . 3. (iii)
Generate with . 4. (iv)
Generate . 5. (v)
Calculate following the stochastic representation (26) of Theorem 3. 6. (vi)
Repeat steps (ii)-(v) for leading to the sample , …, .
The procedure has to be performed for both values of where for the relative number of events will approximate the first summand in (2.3) while for the relative number of events will approximate the second summand in (2.3).
It is important to note that the difference between the error rates calculated for the two decision rules (23) ad (24) could be very large as shown in Figure 1 where and calculated for several values of with fixed values of . If we do not observe large differences between and computed for different sample sizes. However, this statement does not hold any longer when becomes comparable to both and as documented for and . This case is known in the literature as a large-dimensional asymptotic regime and it is investigated in detail in Section 3.
3 Discriminant analysis under large-dimensional asymptotics
In this section we derive the asymptotic distribution of the discriminant function coefficients under the high-dimensional asymptotic regime, that is, when the dimension increases together with the sample sizes and they all tend to infinity. More precisely, we assume that as .
The following conditions are needed for the validity of the asymptotic results:
- (A1)
There exists such that uniformly on . 2. (A2)
It is remarkable that, no assumption on the eigenvalues of the covariance matrix , like they are uniformly bounded on , is imposed. The asymptotic results are also valid when possesses unbounded spectrum as well as when its smallest eigenvalue tends to zero as . The constant is a technical one and it controls the growth rate of the quadratic form. In Theorem 4 the asymptotic distribution of linear combinations of the discriminant function coefficients is provided.
Theorem 4**.**
Assume (A1) and (A2). Let be a -dimensional vector of constants such that is uniformly on , . Then, under the conditions of Theorem 1, the asymptotic distribution of is given by
[TABLE]
for as with
[TABLE]
where denotes the indicator function of set .
Proof.
Using the stochastic representation (11) of Corollary 1, we get
[TABLE]
where , , with ; , and are mutually independently distributed.
Since, , we get that
[TABLE]
for as and, consequently,
[TABLE]
for where and are independent.
Furthermore, we get (see, (Bodnar and Reiß,, 2016, Lemma 3))
[TABLE]
Putting the above results together, we get the statement of the theorem with
[TABLE]
∎
The results of Theorem 4 show that the quantity is present only in the asymptotic variance . Moreover, if , then the factor vanishes and therefore the assumption (A2) is no longer needed. However, in the case of we need (A2) in order to keep the variance bounded. We further investigate this point via simulations in Section 3.3, by choosing and considering small and large such that .
3.1 Classification analysis in high dimension
The error rate of the classification analysis based on the optimal decision rule (23) remains the same independently of and it is always equal to
[TABLE]
In practice, however, \mbox{\boldmath\mu}_{1}, \mbox{\boldmath\mu}_{2}, and are not known and, consequently, one has to make the decision based on (24) instead of (23). In Theorem 5, we derived the asymptotic distribution of under the large-dimensional asymptotics.
Theorem 5**.**
Assume (A1) and (A2). Let and for as . Then, under the conditions of Theorem 1, it holds that
[TABLE]
for as .
Proof.
The application of Theorem 3 leads to
[TABLE]
[TABLE]
where the last line follows from Lemma 3 in Bodnar and Reiß, (2016) and Slutsky Theorem (see, (DasGupta,, 2008, Theorem 1.5)). ∎
The parameters of the limit distribution derived in Theorem 5 can be significantly simplified in the special case of because of . The results of Theorem 5 are also used to derived the approximate error rate for the decision (24). Let . Then, the error rate is given by
[TABLE]
with
[TABLE]
where we approximate by .
In the special case of which leads to , we get
[TABLE]
with
[TABLE]
which is always smaller than one. Furthermore, for we get .
In Figure 2, we plot as a function of for . We also add the plot of in order to compare the error rate of the two decision rules. Since only finite values of are considered in the figure we put and also choose . Finally, the ratio in the definition of is approximated by . We observe that lies very close to for . However, the difference between two curves becomes considerable as growths, especially for and larger values of .
3.2 Finite-sample performance
In this section we present the results of the simulation study. The aim is to investigate how good the asymptotic distribution of a linear combination of the discriminant function coefficients performs in the case of the finite dimension and of the finite sample size. For that reason we compare the asymptotic distribution of the standardized as given in Theorem 4 to the corresponding exact distribution obtained as a kernel density approximation with the Eppanechnikov kernel applied to the simulated data from the standardized exact distribution which are generated following the stochastic representation of Corollary 1: (i) first, are sampled independently from the corresponding univariate distributions provided in Corollary 1; (ii) second, is computed by using (11) and standardized after that as in Theorem 4; (iii) finally, the previous two steps are repeated for times to obtain a sample of size . It is noted that could be large to ensure a good performance of the kernel density estimator.
In the simulation study, we take (-dimensional vector of ones). The elements of and are drawn from the uniform distribution on when , while the first ten elements of \mbox{\boldmath\mu}_{1} and the last ten elements of \mbox{\boldmath\mu}_{2} are generated from the uniform distribution on and the rest of the components are taken to be zero when . We also take as a diagonal matrix, where every element is uniformly distributed on . The results are compared for several values of and the corresponding values of . Simulated data consist of independent repetitions. In both cases and we plot two asymptotic density functions to investigate how robust are the obtained results to the choice of .
In Figures 3-4, we present the results in the case of equal and large sample sizes (data are drawn with in Figure 3 and with in Figure 4), while the plots in Figure 5 correspond to the case of one small sample and one large sample. We observe that the impact of the incorrect specification of is not large, while some deviations are observed in Figure 5 for small values of . If increases, then the difference between the two asymptotic distributions becomes negligible. In contrast, larger differences between the asymptotic distributions and the finite-sample one are observed for and in all figures, although their sizes are relatively small even in such extreme case.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agarwal et al., (2012) Agarwal, A., Negahban, S., and Wainwright, M. J. (2012). Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. Annals of Statistics , 40(2):1171–1197.
- 2Bai et al., (2009) Bai, Z., Jiang, D., Yao, J.-F., and Zheng, S. (2009). Corrections to lrt on large-dimensional covariance matrix by rmt. Annals of Statistics , 37(6B):3822–3840.
- 3Bai and Silverstein, (2010) Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices . New York, NY: Springer Science+ Business Media, LLC.
- 4Bickel and Levina, (2004) Bickel, P. J. and Levina, E. (2004). Some theory for fisher’s linear discriminant function,’naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli , pages 989–1010.
- 5Bodnar et al., (2016) Bodnar, T., Gupta, A., and Parolya, N. (2016). Direct shrinkage estimation of large dimensional precision matrix. Journal of Multivariate Analysis , 146:223–236.
- 6Bodnar et al., (2014) Bodnar, T., Gupta, A. K., and Parolya, N. (2014). On the strong convergence of the optimal linear shrinkage estimator for large dimensional covariance matrix. Journal of Multivariate Analysis , 132:215–228.
- 7(7) Bodnar, T., Mazur, S., and Okhrin, Y. (2017 a). Bayesian estimation of the global minimum variance portfolio. European Journal of Operational Research , 256:292–307.
- 8(8) Bodnar, T., Mazur, S., and Parolya, N. (2017 b). Central limit theorems for functionals of large dimensional sample covariance matrix and mean vector in matrix-variate skewed model. Scandinavian Journal of Statistics , under revision.
