A note on MLE of covariance matrix
Ming-Tien Tsai

TL;DR
This paper investigates the maximum likelihood estimator of covariance matrices in multivariate normal models, showing how different decompositions affect risk and proposing new loss functions to resolve Stein's paradox.
Contribution
It introduces a new class of loss functions that equalize risks across different matrix decompositions, addressing longstanding issues in covariance estimation.
Findings
Full Iwasawa decomposition reduces risk differences
New loss functions eliminate Stein's paradox
MLE risks are equalized across decompositions
Abstract
For a multivariate normal set up, it is well known that the maximum likelihood estimator of covariance matrix is neither admissible nor minimax under the Stein loss function. For the past six decades, a bunch of researches have followed along this line for Stein's phenomenon in the literature. In this note, the results are two folds: Firstly, with respect to Stein type loss function we use the full Iwasawa decomposition to enhance the unpleasant phenomenon that the minimum risks of maximum likelihood estimators for the different coordinate systems (Cholesky decomposition and full Iwasawa decomposition) are different. Secondly, we introduce a new class of loss functions to show that the minimum risks of maximum likelihood estimators for the different coordinate systems, the Cholesky decomposition and the full Iwasawa decomposition, are of the same, and hence the Stein's paradox…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Advanced Algebra and Geometry · Advanced Combinatorial Mathematics
A note on MLE of covariance matrix
Ming-Tien Tsai
Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
Summary: For a multivariate normal set up, it is well known that the maximum likelihood estimator of covariance matrix is neither admissible nor minimax under the Stein loss function. For the past six decades, a bunch of researches have followed along this line for Stein’s phenomenon in the literature. In this note, the results are two folds: Firstly, with respect to Stein type loss function we use the full Iwasawa decomposition to enhance the unpleasant phenomenon that the minimum risks of maximum likelihood estimators for the different coordinate systems (Cholesky decomposition and full Iwasawa decomposition) are different. Secondly, we introduce a new class of loss functions to show that the minimum risks of maximum likelihood estimators for the different coordinate systems, the Cholesky decomposition and the full Iwasawa decomposition, are of the same, and hence the Stein’s paradox disappears.
Keywords: Geodesic distance; Iwasawa decomposition; minimax estimator
1. Introduction
Let be independent -dimensional random vectors with a common multivariate normal distribution N_{p}(\bf 0,{\mbox{\boldmath\Sigma}}). A basic problem considered in the literature is the estimation of the covariance matrix which is unknown and assumed to be nonsingular. It is also assumed that , as such the sufficient statistic
[TABLE]
is positive definite with probability one. In the literature, the estimators of are the functions of . The sample space , the parameter space and the action space are taken to be the set of positive definite matrices. Note that has a Wishart distribution W({\mbox{\boldmath\Sigma}},n) and the maximum likelihood estimator of
[TABLE]
which is unbiased. The general linear group acts on the spaces , and . Generally, we consider the invariance loss function , namely, satisfies the condition that L(g\phi({\bf A})g^{{}^{\prime}},g{\mbox{\boldmath\Sigma}}g^{{}^{\prime}})=L(\phi({\bf A}),{\mbox{\boldmath\Sigma}}) for all . One of the most interesting examples was introduced by Stein (see Jame and Stein, 1961),
[TABLE]
where tr and det denote the trace and the determinant of a matrix, respectively. Because acts transitively on the space , so the best equivalent estimator exists. The minimum risk for the estimator \hat{\mbox{\boldmath\Sigma}} is
[TABLE]
where denotes the expectation of random variable of . It can be easily seen that the maximum likelihood estimator is the best equivalent estimator.
Since the general linear group is not a solvable group, hence relax the condition a little bit by considering the group of lower triangular matrices with positive diagonal elements , the loss function is also invariant under . Using the Cholesky decomposition, we may write , where . Since acts transitively on the space , the best equivalent estimator was established by Stein (see James and Stein, 1961) in the following
[TABLE]
where is a positive diagonal matrix with elements . The minimum risk for the estimator \hat{\mbox{\boldmath\Sigma}}_{S} is
[TABLE]
Since the group is solvable, it follows from results in Kiefer (1957) that the estimator \hat{\mbox{\boldmath\Sigma}}_{JS} is minimax.
In the literature, there are many developements along this approach and its ramplifications, we may refer to the book of Anderson (2003) or the book of Muirhead (1982), and the references cited there, hence we omit the details. With respect to Stein loss function, we use the full Iwasawa decomposition (Terras, 1988) to enchance the Stein’s phenomenon.
2. The full Iwasawa decomposition
The Cholesky decomposition can be viewed as a partial Iwasawa decomposition. We would like to relax the conditions more by considering the full Iwasawa decomposition in this section. Some more notations are needed. Partition {\mbox{\boldmath\Sigma}}_{(k)} and as
[TABLE]
for all with {\mbox{\boldmath\Sigma}}_{(1)}={\mbox{\boldmath\Sigma}} and , also define
[TABLE]
and
[TABLE]
Let
[TABLE]
Then we have
[TABLE]
and
[TABLE]
Let
[TABLE]
By using the full Iwasawa decomposition, we can eventually transform and into the diagonal matrices {\mbox{\boldmath\Sigma}}^{*} and , respectively. Thus we establish the one-to-one correspondences: {\mbox{\boldmath\Sigma}}\leftrightarrow{\mbox{\boldmath\Sigma}}^{*} and . By the properties of Wishart distribution (see Theorem 4.3.4, Theorem 7.3.4 and Theorem 7.3.6 of Anderson, 2003), it is easy to note that are independent random variables with degrees of freeedom respectively. Consider the loss function similar to the equation (3)
[TABLE]
where is a positive diagonal matrix, not depending on .
Theorem 1. With respect to the likelihood loss function (Stein type loss function), the best estimator invariant with respect to one-to-one transformations {\mbox{\boldmath\Sigma}}\to{\mbox{\boldmath\Sigma}}^{*},{\bf A}\to{\bf A}^{*}, is \hat{\mbox{\boldmath\Sigma}}^{*}_{I}={\bf D}^{-1}_{0}{\bf A}^{*}. The minimum risk is , and the estimator is minimax.
Proof. Take {\mbox{\boldmath\Sigma}}={\bf I}, and then note that
[TABLE]
The minimum of (15) occurs at . Since also acts transitively on the space , so the best equivalent estimator exists, which is of the form
[TABLE]
where is a diagonal matrix with elements . Thus the minimum risk for the estimator \hat{\mbox{\boldmath\Sigma}}^{*}_{I} is
[TABLE]
Since the group of positive diagonal matrices is a subset of , which is solvable, thus the group of positive diagonal matrices is also solvable. And hence, by the results of Kiefer (1957) the estimator \hat{\mbox{\boldmath\Sigma}}^{*}_{I} is minimax.
Compare equations (4), (6) and (17), then it is easily to see that
[TABLE]
for . The equality in (18) holds when \hat{\mbox{\boldmath\Sigma}}^{*}_{I}=\hat{\mbox{\boldmath\Sigma}}_{S}=\hat{\mbox{\boldmath\Sigma}}: namely, (i) the components are independent or (ii) as the sample size . With respect to the Stein loss function, the minimum risk functions are different based on the full Iwasawa decomposition and based on the Cholesky decomposition.
We may note that each is the maximum likelihood estimator of , and is unbiased, for all . For each component, is admissible for . Note that saying is the maximum likelihood estimator of is the same as to say that is the maximum likelihood estimator of {\mbox{\boldmath\Sigma}}^{*}. Thus, the results of the equation (18) lead to a paradox that the property of maximum likelihood estimators for the different coordinate systems is not consistent with respect to the Stein type loss function. This motives us to further study whether a suitable loss function exists so that the maximum likelihood estimators can be invariant under reparameterizations.
3. The geodesic distance
Since the space of positive definite symmetric matrices is a non-Euclidean space, it is more natural to use a metric on a Riemannian metric space. The Riemannian metric can be defined with the help of the fundamental form where denotes the matrix of differentials. This is invariant under the transformation , where is any fixed elements of . Let be the set of square symmetric positive definite matrices, this set is a representation space of the group . An element operates on according to . On , any maximal compact subgroup of can be represented in the form . The set of cosets can be considered as a representation space of . Since , thus the all maximal compact subgroup are conjugate. Conjugate subgroups yield equivalent representation spaces, hence it is sufficiently enough to only consider the orthogonal group as the maximal compact group of . The map establishes an equivalence between the representation spaces and of . And hence the defines an invariant metric on . The tangent space to is , the vector space of skew-symmetric matrices. Thus . And hence .
Proposition 1. A geodesic segment through and in has the form: , where has the spectral decomposition: , for and The length of the geodesic segment is
The proof of Proposition 1 can be found in the book of Terras (1988). For any given two points \mbox{\boldmath\Sigma}_{1} and of , the geodesic distance is defined to be , where are the zeros of charactistic polynomial \mbox{det}(\lambda{\mbox{\boldmath\Sigma}}-{\mbox{\boldmath\Sigma}}_{1}). A loss function is invariant iff can be written as a function of the eigenvalues of . Hence from geometric point of view, with the help of Proposition 1 we may naturally consider the compatible, coordinate-invariant loss function L_{G}(\phi({\bf A}),{\mbox{\boldmath\Sigma}})=\sum_{i=1}^{p}\mbox{log}^{2}\lambda_{i}, where denote the eigenvalues of {\mbox{\boldmath\Sigma}}^{-1}\phi({\bf A}). And the risk function is denoted by R_{G}(\hat{\mbox{\boldmath\Sigma}}_{G},{{\mbox{\boldmath\Sigma}}})(={\mathcal{E}}[L_{G}(\phi({\bf A}),{\mbox{\boldmath\Sigma}})]), where \hat{\mbox{\boldmath\Sigma}}_{G} be the corresponding estimator based on the geodesic distance loss function on . Without loss the generality we may take {\mbox{\boldmath\Sigma}}={\bf I} as Stein did. Write . The following studies are comparable to the results in Section 7.8. of Anderson (2003). Let denote the variance of random variable .
Theorem 2. With respect to the geodesic distance loss function on , , where denotes the -th largest eigenvalue of and is a positive constant, , let . Then on the set the minimum of risk function R_{G}(\hat{\mbox{\boldmath\Sigma}}_{G},{\bf I}) occurs at , and its minimum risk is .
Proof. Note that implies that , which is the same as the conditions that . By Jensen inequality, we have that . Let be the point so that . Then is the critical point for the risk function R_{G}(\hat{\mbox{\boldmath\Sigma}}_{G},{\bf I}).
Moreover, we may note that . By the Jensen inequality, we may note that . Then, on the set , we have , and . Thus the risk function R_{G}(\hat{\mbox{\boldmath\Sigma}}_{G},{\bf I}) is convex and has an unique minimum on the set . Since the set is a connected set, and , and hence, the theorem follows.
Theorem 2 provides us a new class of loss functions for statistical inference. Although Theorem 2 looks simple mathematically, it provides us an intrinsicaly new approach to make statistical inference. With respect to the geodesic distance loss function, we may see that MLE is invariant under different parameterizations, and more importantly, the Stein’s paradox disappears. Those results are tremendously different from what have existed in the literature with respect to Stein type loss functions.
To obtain the best equivalent estimator with respect to geodesic distance loss function, the quantities , which can be viewed as the geometric means of the distributions, are needed to be evaluated. Some often seen cases are illustrated in the followings:
I. The full Iwasawa decomposition form. Based on the one-to-one transformation: {\mbox{\boldmath\Sigma}}\leftrightarrow{\mbox{\boldmath\Sigma}}^{*} and , as such using the geodesic distance on , namely, to use the loss function , where are the eigenvalues of . Thus, . Thus, by Theorem 2 the minimum of it occurs at . Thus the geometric estimator \hat{\mbox{\boldmath\Sigma}}^{*}_{GI} of {\mbox{\boldmath\Sigma}}^{*} is that \hat{\mbox{\boldmath\Sigma}}^{*}_{GI}={\bf D}^{*}{\bf A}^{*}, where is a diagonal matrix with elements and in (13) . Since is a diagonal matrix, the group of positive diagonal matrices is solvable, and hence by the results of Kiefer (1957) this geometric estimator \hat{\mbox{\boldmath\Sigma}}^{*}_{GI} is minimax. And its minimum risk is R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{GI},{\bf I})=\sum_{i=1}^{p}{\mathcal{V}ar}[\mbox{log}{\chi}^{2}_{n-i+1}]. Also note that R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{I},{\bf I})=R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{GI},{\bf I})+\sum_{i=1}^{p}\{\mbox{log}(n-i+1)-{\mathcal{E}}[\mbox{log}{\chi}^{2}_{n-i+1}]\}^{2},i.e.,R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{GI},{\bf I})<R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{I},{\bf I}), where \hat{\mbox{\boldmath\Sigma}}^{*}_{I} is defined in (16). Thus we may conclude that the Stein estimator \hat{\mbox{\boldmath\Sigma}}^{*}_{I} is inadmissible with respect to geodesic distance loss function.
II. The Cholesky decomposition form, can be viewed as a partial Iwasawa decomposition. Riemannian geometry yields an invariant volume element on , it is of the form , where with being the Lebesgue measure on . This is the invariant -form on . Normalization is not necessary because this invariant -form is not a probability measure. Note that this -form is still invariant under . With the Cholesky decomposition and differentiate at . Thus , and for . Then the -form becomes to , where denotes the wedge product. Let denote and write . Similarly, we do the Cholesky decomposition for the scale parameter as {\mbox{\boldmath\Sigma}}={\mbox{\boldmath\Theta}}{\mbox{\boldmath\Theta}}^{{}^{\prime}}, where {\mbox{\boldmath\Theta}}=((\theta_{ij})) is a lower triangular matrix. Take {\mbox{\boldmath\Sigma}}={\bf I}, the Wishart density of then becomes to the following density function
[TABLE]
where . This is essentially called the Bartlett decomposition in the literature. Note that is a lower triangular matrix which having the eigenvalues . With respect to the geodesic distance loss function, the density function (19) can be further reduced to the product of the density with degrees of freedom, . Thus, the risk function of geodesic distance loss function is . By Theorem 2, the minimum of it occurs at . Let , and write , where if and . We may note that is the unbiased MLE of . Thus, the best equivalent estimator is of the form \hat{\mbox{\boldmath\Sigma}}_{GC}={\bf T}_{0}{\bf T}^{{}^{\prime}}_{0} and its minimum risk is R_{G}(\hat{\mbox{\boldmath\Sigma}}_{GC},{\bf I})=\sum_{i=1}^{p}{\mathcal{V}ar}[\mbox{log}{\chi}^{2}_{n-i+1}]. Since the group is solvable, it follows from results in Kiefer (1957) that the estimator \hat{\mbox{\boldmath\Sigma}}_{GC} is minimax with respect to geodesic distance loss function.
This minimum risk is equivalent to that in Example 1, which indicates that with respect to the geodesic distance loss function on the space of positive definite symmetric matrices the minimum risks of maximum likelihood estimators with the different coordinate systems, the Cholesky decomposition and the full Iwasawa decomposition, are of the same. These results are quite different from what have existed in the literature based on the Stein type loss function, see Section 2 for the details. Moreover, note that R_{G}(\hat{\mbox{\boldmath\Sigma}}_{S},{\bf I})=\sum_{i=1}^{p}{\mathcal{E}}[\mbox{log}^{2}({\chi}^{2}_{n-i+1}/n+p-2i+1)], where \hat{\mbox{\boldmath\Sigma}}_{S} is defined in (5). It is easy to see that R_{G}(\hat{\mbox{\boldmath\Sigma}}_{S},{\bf I})=R_{G}(\hat{\mbox{\boldmath\Sigma}}_{GC},{\bf I})+\sum_{i=1}^{p}\{\mbox{log}(n+p-2i+1)-{\mathcal{E}}[\mbox{log}{\chi}^{2}_{n-i+1}]\}^{2},i.e.,R_{G}(\hat{\mbox{\boldmath\Sigma}}_{GC},{\bf I})<R_{G}(\hat{\mbox{\boldmath\Sigma}}_{S},{\bf I}). Thus we may conclude that the Stein estimator \hat{\mbox{\boldmath\Sigma}}_{S} is inadmissible with respect to geodesic distance loss function.
III. The orthogonal decomposition form. Stein (1956) considered the rotation-equivariant estimator of . The class of rotation-equivariant estimators of covariance matrix is constituted of all the estimators that have the same eigenvectors as the sample covariance matrix. Let denotes the -th largest eigenvalue of {\mbox{\boldmath\Sigma}}^{-1}{\bf A}. Also write , where is a diagonal matrix with eigenvalues and being the corresponding orthogonal matrix. Similarly, write {\mbox{\boldmath\Sigma}}={\bf H}{\mbox{\boldmath\Gamma}}{\bf H}^{{}^{\prime}}, where is a diagonal matrix with eigenvalues and being the corresponding orthogonal matrix. Take {\mbox{\boldmath\Sigma}}={\bf I}, thus for the class of rotation-equivariant estimators the minimum risk function based on the geodesic distance loss function is , and its minimium occurs at . Thus, the best rotation-equivariant estimator is of the form \hat{\mbox{\boldmath\Sigma}}_{GO}={\bf U}{\bf D}^{*}{\bf L}{\bf U}^{{}^{\prime}}, where is a diagonal matrix with elements . On the other hand, the risk function based on the Stein loss function is The minimum of it occurs at . Thus, with respect to Stein loss function the best rotation-equivariant estimator for is of the form \hat{\mbox{\boldmath\Sigma}}_{O}={\bf U}{\bf D}^{*}_{0}{\bf L}{\bf U}^{{}^{\prime}}, where is a diagonal matrix with elements . Similarly, we may also note that R_{G}(\hat{\mbox{\boldmath\Sigma}}_{O},{\bf I})=R_{G}(\hat{\mbox{\boldmath\Sigma}}_{GO},{\bf I})+\sum_{i=1}^{p}\{\mbox{log}{\mathcal{E}}[l_{i}]-{\mathcal{E}}[\mbox{log}{l_{i}}]\}^{2},i.e.,R_{G}(\hat{\mbox{\boldmath\Sigma}}_{GO},{\bf I})<R_{G}(\hat{\mbox{\boldmath\Sigma}}_{O},{\bf I}). Thus, under the orthogonal decomposition the Stein type estimator \hat{\mbox{\boldmath\Sigma}}_{O} is inadmissible with respect to geodesic distance loss function.
Via the result of Askey (1980), the joint density function of eigenvalues is of the form
[TABLE]
It is easy to see that . However, we have difficulty to find out the explicit form for the marginal density function of sample eigenvalue , and open this type Selberg integral as a conjecture.
4. General remarks
Entropy (expectation of likelihood loss function) not only plays an important role in information theory, but also is a core in statistical theory. For the past six decades, quadratic loss function and likelihood loss function are oftenly adopted to study the Stein’s phenomenon for the covariance matrix estimation, for the details see the Section 7.8. of Anderson (2003) or the Section 4.3. of Muirhead (1982). Via the full Iwasawa decomposition, Theorem 1 tells us that the likelihood (Stein type) loss function is not invariant to arbitrary reparameterizations of . To overcome the drawbacks, Riemannian metric is a natural way to be adopted as a loss function for a non-Eculidean space . Due to the diffeomorphism invarinace of risk function based on the geodesic distance, we may anticipate that the minimum risks of the MLEs may not only be invariant to reparameterizations but also the Stein paradox disappear with respect to geodesic distance loss function. Examples 1 and 2 tell us that the minimum risks of the MLEs of covariance matrices under the different coordinate systems, the Cholesky decomposition and the full Iwasawa decomposition, are of the same with respect to the geodesic distance loss function. Examples 1 and 2 also indicate that the MLE of covariance matrix is minimax with respect to the geodesic distance loss function on . This note will inevitably have an impact on statistical inference because the statisticans have to reconsider the adoptation of the MLE of covariance matrix, which, however, had been constantly warned not to use for a long time since Stein’s phenonmenon occurred.
If the quantity in Theorem 2 can be approximated by the quantity , , then we have that \hat{\mbox{\boldmath\Sigma}}^{*}_{GI}=\hat{\mbox{\boldmath\Sigma}}^{*}_{I} and \hat{\mbox{\boldmath\Sigma}}_{GO}=\hat{\mbox{\boldmath\Sigma}}_{O} approximatly. Moreover, R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{GI},{\bf I})=R_{G}(\hat{\mbox{\boldmath\Sigma}}^{*}_{I},{\bf I})=R_{G}(\hat{\mbox{\boldmath\Sigma}}_{GC},{\bf I})<R_{G}(\hat{\mbox{\boldmath\Sigma}}_{S},{\bf I}) approximatly. We omit the details.
Example 3 points out the fact that based on the orthogonal decomposition, the analytical difficulty to find out the marginal expectations will be accompanied with the orthogonal decomposition due to the Selberg type integral which involves the Vandermonde determinant. Similar difficulty will occur when to obtain the density functions of and (for details see Edelman, 1989). When such that , for a Wishart matrix Edeiman (1989) proved that the geometric means of and , respectively.
With respect to Stein loss function, Stein (1956) made statistical inference focused on the special case \mbox{\boldmath\Sigma}={\bf I}, it is sufficient enough due to the invariance consideration. For the main purpose of focusing on statistical inference in this note, we adopt the same structure as Stein did. However, this will scarifice the developement of distribution theory of arbitrary covariance . For this purpose, the elegant zonal polynomials have been incorporated, we may refer the book of Muirhead (1982) for the details.
The invariance nature of geodesic distance loss function suggests that it should be the way to deal with state-of-the-art covariance matrix estimators for the interesting and timely large dimensional case. For the large dimensional case, the sample size is required to be the same order of dimension. The emperical density of eigenvalues of Wishart matrix converges to the Marchenko-Paster law in the limit when such that is fixed The geometric mean of this distribution is , where .
References
Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed.. Wiley, New York. 2. 2.
Askey, R. (1980). Some basic hypergeometric extensions of integrals of Selberg and Andrews. SIAM, J. Math. Anal., 11, 938-951. 3. 3.
Edelman, A. (1989). Eigenvalues and condition numbers of random matrices. Ph.D thesis, Massachusetts institute of technology. 4. 4.
James, W. and Stein, C. (1961). Estimation with quadratic loss. Proc. Fourth Berleley Symp. Math. Statist. Probab., 1, 361-379. California Press, Berkeley, CA. 5. 5.
Kiefer, J. (1957). Invariance, minimax sequential estimation, and continuous time processes. Ann. Math. Statist., 28, 573-601. 6. 6.
Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York. 7. 7.
Terras, A. (1988). Harmonic Analysis on Symmetric Spaces and Applications. II. Springer, Berlin.
