Unconstrained representation of orthogonal matrices with application to   common principle components

Luca Bagnato; Antonio Punzo

arXiv:1906.00587·stat.ME·June 4, 2019

Unconstrained representation of orthogonal matrices with application to common principle components

Luca Bagnato, Antonio Punzo

PDF

Open Access

TL;DR

This paper introduces the PLR decomposition, a simple method to unconstrain orthogonal matrices, enabling easier optimization, and demonstrates its application to robust common principal components analysis with leptokurtic-normal distributions.

Contribution

The paper proposes the PLR decomposition, allowing unconstrained optimization of orthogonal matrices, and applies it to robust CPCA with leptokurtic-normal distributions.

Findings

01

PLR decomposition simplifies orthogonal matrix estimation.

02

Application to leptokurtic-normal CPCA improves robustness.

03

Demonstrated effectiveness on biometric data analyses.

Abstract

Many statistical problems involve the estimation of a $(d \times d)$ orthogonal matrix $Q$ . Such an estimation is often challenging due to the orthonormality constraints on $Q$ . To cope with this problem, we propose a very simple decomposition for orthogonal matrices which we abbreviate as PLR decomposition. It produces a one-to-one correspondence between $Q$ and a $(d \times d)$ unit lower triangular matrix $L$ whose $d (d - 1) /2$ entries below the diagonal are unconstrained real values. Once the decomposition is applied, regardless of the objective function under consideration, we can use any classical unconstrained optimization method to find the minimum (or maximum) of the objective function with respect to $L$ . For illustrative purposes, we apply the PLR decomposition in common principle components analysis…

Tables8

Table 1. Table 1: ML estimated parameters of the competing models for the microtus data. Eigenvalues are multiplied by 100.

N-CPC	LN-CPC	N-PC	LN-PC
$𝝁_{1} = [\begin{matrix} 8.566 \\ 6.149 \end{matrix}]$	$𝝁_{1} = [\begin{matrix} 8.563 \\ 6.148 \end{matrix}]$	$𝝁_{1} = [\begin{matrix} 8.566 \\ 6.149 \end{matrix}]$	$𝝁_{1} = [\begin{matrix} 8.563 \\ 6.148 \end{matrix}]$
$𝝁_{2} = [\begin{matrix} 8.476 \\ 6.056 \end{matrix}]$	$𝝁_{2} = [\begin{matrix} 8.478 \\ 6.056 \end{matrix}]$	$𝝁_{2} = [\begin{matrix} 8.476 \\ 6.056 \end{matrix}]$	$𝝁_{2} = [\begin{matrix} 8.478 \\ 6.056 \end{matrix}]$
$𝑸 = [\begin{matrix} - 0.850 & - 0.527 \\ - 0.527 & 0.850 \end{matrix}]$	$𝑸 = [\begin{matrix} - 0.886 & - 0.463 \\ - 0.463 & 0.886 \end{matrix}]$	$𝑸_{1} = [\begin{matrix} - 0.766 & - 0.643 \\ - 0.643 & 0.766 \end{matrix}]$	$𝑸_{1} = [\begin{matrix} - 0.851 & - 0.525 \\ - 0.525 & 0.851 \end{matrix}]$
		$𝑸_{2} = [\begin{matrix} - 0.908 & - 0.420 \\ - 0.420 & 0.908 \end{matrix}]$	$𝑸_{2} = [\begin{matrix} - 0.918 & - 0.395 \\ - 0.395 & 0.918 \end{matrix}]$
$𝚲_{1} = [\begin{matrix} 0.408 & 0 \\ 0 & 0.140 \end{matrix}]$	$𝚲_{1} = [\begin{matrix} 0.437 & 0 \\ 0 & 0.131 \end{matrix}]$	$𝚲_{1} = [\begin{matrix} 0.413 & 0 \\ 0 & 0.134 \end{matrix}]$	$𝚲_{1} = [\begin{matrix} 0.432 & 0 \\ 0 & 0.130 \end{matrix}]$
$𝚲_{2} = [\begin{matrix} 0.378 & 0 \\ 0 & 0.123 \end{matrix}]$	$𝚲_{2} = [\begin{matrix} 0.376 & 0 \\ 0 & 0.121 \end{matrix}]$	$𝚲_{2} = [\begin{matrix} 0.382 & 0 \\ 0 & 0.119 \end{matrix}]$	$𝚲_{2} = [\begin{matrix} 0.378 & 0 \\ 0 & 0.122 \end{matrix}]$
	$β_{1} = 4.753$		$β_{1} = 4.260$
	$β_{2} = 1.778$		$β_{2} = 1.870$

Table 2. Table 2: Model comparison, for the microtus data, in terms of number of parameters ( m 𝑚 m ), log-likelihood, AIC, and BIC (on the left) and p 𝑝 p -values from the LR tests (on the right).

	$m$	Log-likelihood	AIC	BIC
N-CPC	9	289.319	560.639	538.241
LN-CPC	11	293.946	565.891	538.516
N-PC	10	290.416	560.831	535.945
LN-PC	12	294.220	564.439	534.575

Table 3. (a) Number of parameters, log-likelihood, AIC, and BIC

	$m$	Log-likelihood	AIC	BIC
N-CPC	9	289.319	560.639	538.241
LN-CPC	11	293.946	565.891	538.516
N-PC	10	290.416	560.831	535.945
LN-PC	12	294.220	564.439	534.575

Table 4. (b) p 𝑝 p -values from the LR tests

	LN-CPC	N-PC	LN-PC
N-CPC	0.010	0.139	0.020
LN-CPC	–	–	0.459
N-PC	–	–	0.022

Table 5. Table 3: ML estimated parameters of the competing models for the swiss.soldiers data. Eigenvalues are multiplied by 100.

N-CPC	LN-CPC	N-PC	LN-PC
$𝝁_{1} = [\begin{matrix} 4.721 \\ 4.810 \\ 4.931 \end{matrix}]$	$𝝁_{1} = [\begin{matrix} 4.718 \\ 4.807 \\ 4.929 \end{matrix}]$	$𝝁_{1} = [\begin{matrix} 4.721 \\ 4.810 \\ 4.931 \end{matrix}]$	$𝝁_{1} = [\begin{matrix} 4.719 \\ 4.808 \\ 4.931 \end{matrix}]$
$𝝁_{2} = [\begin{matrix} 4.714 \\ 4.792 \\ 4.913 \end{matrix}]$	$𝝁_{2} = [\begin{matrix} 4.722 \\ 4.796 \\ 4.917 \end{matrix}]$	$𝝁_{2} = [\begin{matrix} 4.714 \\ 4.792 \\ 4.913 \end{matrix}]$	$𝝁_{2} = [\begin{matrix} 4.721 \\ 4.795 \\ 4.916 \end{matrix}]$
$𝑸 = [\begin{matrix} - 0.783 & 0.546 & - 0.298 \\ - 0.464 & - 0.832 & - 0.304 \\ - 0.414 & - 0.100 & 0.905 \end{matrix}]$	$𝑸 = [\begin{matrix} - 0.750 & 0.559 & - 0.353 \\ - 0.501 & - 0.829 & - 0.247 \\ - 0.431 & - 0.009 & 0.902 \end{matrix}]$	$𝑸_{1} = [\begin{matrix} - 0.647 & 0.230 & - 0.727 \\ - 0.574 & - 0.775 & 0.266 \\ - 0.502 & 0.589 & 0.633 \end{matrix}]$	$𝑸_{1} = [\begin{matrix} - 0.634 & 0.290 & - 0.717 \\ - 0.566 & - 0.806 & 0.174 \\ - 0.527 & 0.516 & 0.675 \end{matrix}]$
		$𝑸_{2} = [\begin{matrix} - 0.796 & 0.531 & - 0.291 \\ - 0.448 & - 0.840 & - 0.305 \\ - 0.407 & - 0.113 & 0.907 \end{matrix}]$	$𝑸_{2} = [\begin{matrix} - 0.765 & 0.554 & - 0.327 \\ - 0.486 & - 0.831 & - 0.271 \\ - 0.422 & - 0.049 & 0.905 \end{matrix}]$
$𝚲_{1} = [\begin{matrix} 0.282 & 0 & 0 \\ 0 & 0.135 & 0 \\ 0 & 0 & 0.130 \end{matrix}]$	$𝚲_{1} = [\begin{matrix} 0.303 & 0 & 0 \\ 0 & 0.149 & 0 \\ 0 & 0 & 0.128 \end{matrix}]$	$𝚲_{1} = [\begin{matrix} 0.289 & 0 & 0 \\ 0 & 0.153 & 0 \\ 0 & 0 & 0.105 \end{matrix}]$	$𝚲_{1} = [\begin{matrix} 0.304 & 0 & 0 \\ 0 & 0.161 & 0 \\ 0 & 0 & 0.110 \end{matrix}]$
$𝚲_{2} = [\begin{matrix} 0.867 & 0 & 0 \\ 0 & 0.269 & 0 \\ 0 & 0 & 0.153 \end{matrix}]$	$𝚲_{2} = [\begin{matrix} 0.925 & 0 & 0 \\ 0 & 0.303 & 0 \\ 0 & 0 & 0.166 \end{matrix}]$	$𝚲_{2} = [\begin{matrix} 0.868 & 0 & 0 \\ 0 & 0.269 & 0 \\ 0 & 0 & 0.153 \end{matrix}]$	$𝚲_{2} = [\begin{matrix} 0.916 & 0 & 0 \\ 0 & 0.295 & 0 \\ 0 & 0 & 0.165 \end{matrix}]$
	$β_{1} = 1.813$		$β_{1} = 1.778$
	$β_{2} = 5.601$		$β_{2} = 5.331$

Table 6. Table 4: Model comparison, for the swiss.soldiers data, in terms of number of parameters ( m 𝑚 m ), log-likelihood, AIC, and BIC (on the left) and p 𝑝 p -values from the LR tests (on the right).

	$m$ par.	Log-likelihood	AIC	BIC
N-CPC	15	1176.166	2322.333	2268.980
LN-CPC	17	1190.986	2347.973	2287.506
N-PC	19	1178.091	2320.182	2256.159
LN-PC	20	1192.435	2344.869	2273.733

Table 7. (a) Number of parameters, log-likelihood, AIC, and BIC

	$m$ par.	Log-likelihood	AIC	BIC
N-CPC	15	1176.166	2322.333	2268.980
LN-CPC	17	1190.986	2347.973	2287.506
N-PC	19	1178.091	2320.182	2256.159
LN-PC	20	1192.435	2344.869	2273.733

Table 8. (b) p 𝑝 p -values from the LR tests

	LN-CPC	N-PC	LN-PC
N-CPC	0.000	0.278	0.000
LN-CPC	–	–	0.408
N-PC	–	–	0.000

Equations27

Q = P L R^{- 1},

Q = P L R^{- 1},

Q = P L U .

Q = P L U .

Σ_{j} = Q Λ_{j} Q^{'}, j = 1, \dots, k .

Σ_{j} = Q Λ_{j} Q^{'}, j = 1, \dots, k .

l_{N-CPC} (Ψ) = C - \frac{1}{2} j = 1 \sum k n_{j} [h = 1 \sum d ln λ_{j h} + \mbox t r (Q Λ_{j}^{- 1} Q^{'} S_{j})],

l_{N-CPC} (Ψ) = C - \frac{1}{2} j = 1 \sum k n_{j} [h = 1 \sum d ln λ_{j h} + \mbox t r (Q Λ_{j}^{- 1} Q^{'} S_{j})],

f_{LN} (x; μ, Σ, β) = q [(x - μ)^{'} Σ^{- 1} (x - μ); β] f_{N} (x; μ, Σ), x \in I R^{d},

f_{LN} (x; μ, Σ, β) = q [(x - μ)^{'} Σ^{- 1} (x - μ); β] f_{N} (x; μ, Σ), x \in I R^{d},

q (y; β) = 1 + \frac{β}{8 d ( d + 2 )} [y^{2} - 2 (d + 2) y + d (d + 2)], y \geq 0.

q (y; β) = 1 + \frac{β}{8 d ( d + 2 )} [y^{2} - 2 (d + 2) y + d (d + 2)], y \geq 0.

l_{LN-CPC} (Ψ) =

l_{LN-CPC} (Ψ) =

+ j = 1 \sum k i = 1 \sum n_{j} ln {1 + \frac{β _{j}}{8 d ( d + 2 )} [\mbox t r (Q Λ_{j}^{- 1} Q^{'} S_{ij})^{2} - 2 (d + 2) \mbox t r (Q Λ_{j}^{- 1} Q^{'} S_{ij}) + d (d + 2)]},

Q = P L R^{- 1},

Q = P L R^{- 1},

L = P^{'} Q R,

L = P^{'} Q R,

λ_{j h} = exp (λ_{j h}),

λ_{j h} = exp (λ_{j h}),

λ_{j h} = lo g (λ_{j h}) .

λ_{j h} = lo g (λ_{j h}) .

β_{j} = β_{m a x} \frac{exp ( β _{j} )}{1 + exp ( β _{j} )},

β_{j} = β_{m a x} \frac{exp ( β _{j} )}{1 + exp ( β _{j} )},

β_{j} = lo g (\frac{β _{j}}{β _{m a x} - β _{j}}) .

β_{j} = lo g (\frac{β _{j}}{β _{m a x} - β _{j}}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpectroscopy and Chemometric Analyses · Advanced Statistical Methods and Models · Sensory Analysis and Statistical Methods

Full text

Unconstrained representation of orthogonal matrices with application to common principle components

Luca Bagnato Antonio Punzo

Università Cattolica del Sacro Cuore, Dipartimento di Scienze Economiche e Sociali, Via Emilia Parmense, 29122 Piacenza, Italy, e.mail: [email protected] Università di Catania, Dipartimento di Economia e Impresa, Corso Italia 55, 95129, Catania, Italy, e.mail: [email protected]

Abstract

Many statistical problems involve the estimation of a $\left(d\times d\right)$ orthogonal matrix $\boldsymbol{Q}$ . Such an estimation is often challenging due to the orthonormality constraints on $\boldsymbol{Q}$ . To cope with this problem, we propose a very simple decomposition for orthogonal matrices which we abbreviate as PLR decomposition. It produces a one-to-one correspondence between $\boldsymbol{Q}$ and a $\left(d\times d\right)$ unit lower triangular matrix $\boldsymbol{L}$ whose $d\left(d-1\right)/2$ entries below the diagonal are unconstrained real values. Once the decomposition is applied, regardless of the objective function under consideration, we can use any classical unconstrained optimization method to find the minimum (or maximum) of the objective function with respect to $\boldsymbol{L}$ . For illustrative purposes, we apply the PLR decomposition in common principle components analysis (CPCA) for the maximum likelihood estimation of the common orthogonal matrix when a multivariate leptokurtic-normal distribution is assumed in each group. Compared to the commonly used normal distribution, the leptokurtic-normal has an additional parameter governing the excess kurtosis; this makes the estimation of $\boldsymbol{Q}$ in CPCA more robust against mild outliers. The usefulness of the PLR decomposition in leptokurtic-normal CPCA is illustrated by two biometric data analyses.

Key words: Orthogonal matrix, LU decomposition, QR decomposition, Common principal components, FG algorithm, Leptokurtic-normal distribution.

1 Introduction

With the term orthogonal matrix we refer to a $\left(d\times d\right)$ matrix $\boldsymbol{Q}$ whose columns are mutually orthogonal unit vectors (i.e., orthonormal vectors). As highlighted by Banerjee and Roy (2014, p. 209), one might, perhaps more properly, call $\boldsymbol{Q}$ an “orthonormal” matrix, but the more conventional name is an “orthogonal” matrix, and we will adopt it hereafter. For further characterizations, properties, and details about orthogonal matrices see, e.g., Lütkepohl (1996, Chapter 9.10), Healy (2000, Chapter 3.5), Schott (2016, Chapter 1.10), and Searle and Khuri (2017, Chapter 5.4). Orthogonal matrices are used extensively in statistics, especially in linear models and multivariate analysis (see, e.g., Graybill, 1976, Chapter 11 and James, 1954).

The $d^{2}$ elements of $\boldsymbol{Q}$ are subject to $d\left(d+1\right)/2$ (orthonormality) constraints. It is therefore not surprising that they can be represented by only $d^{2}-d\left(d+1\right)/2=d\left(d-1\right)/2$ independent parameters. A representation is convenient if $\boldsymbol{Q}$ can be quickly computed from these $d\left(d-1\right)/2$ independent parameters. Such a representation should facilitate the search for an orthogonal matrix that satisfies a certain optimality criterion induced by the chosen estimation method, especially if these independent parameters were real-valued (Khuri, 2003). Methods to parameterize an orthogonal matrix are reviewed in Khuri and Good (1989). A similar problem is bumped into when a $\left(d\times d\right)$ positive-definite matrix $\boldsymbol{\Sigma}$ , often encountered in statistics in the form of a covariance matrix, needs to be estimated. Luckily, in this case, the Cholesky decomposition allows to map the $d\left(d+1\right)/2$ independent parameters of $\boldsymbol{\Sigma}$ with the $d\left(d+1\right)/2$ real-valued elements of a $\left(d\times d\right)$ lower triangular matrix (Pourahmadi, 1999, 2000 and Pourahmadi et al., 2007).

Unfortunately, an analogous of the Cholesky decomposition does not exist for $\boldsymbol{Q}$ . In this paper, we fill the gap by providing the PLR decomposition: similarly to the Cholesky decomposition, it maps $\boldsymbol{Q}$ to a $\left(d\times d\right)$ unit lower triangular matrix $\boldsymbol{L}$ whose $d\left(d-1\right)/2$ entries below the diagonal are real-valued. The PLR decomposition is based on the famous QR and PLU factorizations which are used, in our context, for squared and invertible matrices.

For illustrative purposes, we apply the PLR decomposition in common principal component analysis (CPCA), where the space spanned by the $d$ vectors (principal components) of $\boldsymbol{Q}$ is assumed to be identical across several known groups, whereas the variances associated with the common principal components may vary. When the groups are assumed to be normally distributed, as typically happens, we can use the FG algorithm developed by Flury and Gautschi (1986) for the estimation of $\boldsymbol{Q}$ . Although the FG algorithm is distribution-free (Flury, 1988, p. 71), under non-normal distributions it may provide an orthogonal matrix which does not maximize the likelihood. Motivated by this consideration, we assume groups having a leptokurtic-normal distribution and use the PLR decomposition to allow $\boldsymbol{Q}$ to be estimated by any standard unconstrained maximization routine. The leptokurtic-normal is an heavy-tailed generalization of the normal distribution that can be preferred, for example, in the presence of mild outliers.

The paper is organized as follow. In Section 2 we introduce the PLR decomposition. In Section 3 we first define the CPCA based on leptokurtic-normal groups, and then consider the PLR decomposition in the estimation of the common orthogonal matrix. In Section 4 we illustrate the leptokurtic-normal CPCA in allometric studies by using two well-known biometric data sets, where the method shows its better performance with respect to the consolidated normal CPCA. Nevertheless, we want to stress that our goal is not to propose a new robust method for estimating the common principal components. Instead, we simply want to illustrate how, regardless of the way the orthogonal matrix enters in the considered model, we can use our PLR decomposition, along with any unconstrained optimization routine, to estimate $\boldsymbol{Q}$ , without the need to define ad-hoc estimating algorithms. Finally, we give conclusions and avenues for further research in Section 5.

2 PLR decomposition of orthogonal matrices

Before to present the PLR decomposition for orthogonal matrices, Definitions 2.1 and 2.2 recall the well-known QR and PLU decompositions.

Definition 2.1 (QR decomposition).

If $\boldsymbol{A}$ is a $\left(d\times d\right)$ invertible matrix, then there is a unique $\left(d\times d\right)$ orthogonal matrix $\boldsymbol{Q}$ , and a unique $\left(d\times d\right)$ upper triangular matrix $\boldsymbol{R}$ , such that $\boldsymbol{A}=\boldsymbol{Q}\boldsymbol{R}$ .

Definition 2.2 (PLU decomposition).

If $\boldsymbol{A}$ is a $(d\times d)$ invertible matrix, then there is a $(d\times d)$ permutation matrix $\boldsymbol{P}$ , a $\left(d\times d\right)$ unit lower triangular matrix $\boldsymbol{L}$ , and a $\left(d\times d\right)$ upper triangular matrix $\boldsymbol{U}$ , such that $\boldsymbol{A}=\boldsymbol{P}\boldsymbol{L}\boldsymbol{U}$ .

The following theorem presents our PLR decomposition.

Theorem 2.1 (PLR decomposition).

If $\boldsymbol{Q}$ is a $\left(d\times d\right)$ orthogonal matrix, then it can be factorized as

[TABLE]

where $\boldsymbol{P}$ and $\boldsymbol{L}$ are defined as in Definition 2.2, and $\boldsymbol{R}$ is the upper triangular matrix coming from the QR decomposition of $\boldsymbol{A}=\boldsymbol{P}\boldsymbol{L}$ (see Definition 2.1).

Proof.

Because $\boldsymbol{Q}$ is an orthogonal matrix, it is invertible and, according to Definition 2.2, admits the PLU decomposition

[TABLE]

Because any unit lower triangular matrix is invertible, $\boldsymbol{L}$ , as well as $\boldsymbol{A}=\boldsymbol{P}\boldsymbol{L}$ , are invertible. Then, according to Definition 2.1, $\boldsymbol{A}$ admits the QR decomposition $\boldsymbol{A}=\boldsymbol{Q}\boldsymbol{R}$ , which we recall to be unique. Then, it is easy to verify that $\boldsymbol{U}$ must be equal to $\boldsymbol{R}^{-1}$ , and the theorem is proved. ∎

$\boldsymbol{L}$ is the key matrix of the PLR decomposition in (1); indeed, $\boldsymbol{P}$ can be thought to as a sort of nuisance matrix only affecting the ordering of the columns of $\boldsymbol{Q}$ – and we know that such an ordering is often not of interest – and $\boldsymbol{R}$ is a function of $\boldsymbol{L}$ . In particular, $\boldsymbol{R}$ is the upper triangular matrix coming from the QR decomposition of $\boldsymbol{A}=\boldsymbol{P}\boldsymbol{L}$ . Note that, the number of free elements in $\boldsymbol{L}$ (which are those below the main diagonal) is $d\left(d-1\right)/2$ , as the number of free elements of the orthogonal matrix $\boldsymbol{Q}$ . This means that: 1) any orthogonal matrix admits the PLR decomposition in (1), and 2) any pair $\left\{\boldsymbol{P},\boldsymbol{L}\right\}$ is associated to an orthogonal matrix.

3 Leptokurtic-normal common principal components

The advantages of our PLR decomposition can be appreciated in many statistical fields. Among them there is the common principal component analysis. Below, we give an example by considering groups being distributed according to a multivariate leptokurtic-normal.

3.1 Preliminaries

Let $\left\{\boldsymbol{x}_{ij};\text{$ i=1,\ldots,n_{j} $,$ j=1,\ldots,k $}\right\}$ , with $\boldsymbol{x}_{ij}\in IR^{d}$ , be a set of $n=\sum_{j=1}^{k}n_{j}$ independent observations from $k$ independent groups (or subpopulations) having mean $\boldsymbol{\mu}_{j}$ and covariance matrix $\boldsymbol{\Sigma}_{j}$ . If the inferential interest is on $\boldsymbol{\Sigma}_{1},\ldots,\boldsymbol{\Sigma}_{k}$ , then there is the need to estimate $kd\left(d+1\right)/2$ parameters. Such a number may be excessive when $k$ , but especially $d$ , are large, causing problems in the estimation phase. These problems can often be avoided if $\boldsymbol{\Sigma}_{1},\ldots,\boldsymbol{\Sigma}_{k}$ exhibit some common structure, and several models have been proposed in this direction (see, e.g., Flury, 1984, 1986a, 1987, Boik, 2002, and Greselin et al., 2011). The assessment of a common covariance structure, in addition to allow for parsimony, can provide more information about the group conditional distributions (Greselin and Punzo, 2013) and it is of intrinsic interest in several fields such as biometry (refer to Section 4).

Most of the existing common covariance structures are based on the eigen-decomposition $\boldsymbol{\Sigma}_{j}=\boldsymbol{Q}_{j}\boldsymbol{\Lambda}_{j}\boldsymbol{Q}_{j}^{\prime}$ , $j=1,\ldots,k$ , where $\boldsymbol{\Lambda}_{j}=\mathop{\mathrm{diag}}\left(\lambda_{j1},\ldots,\lambda_{jd}\right)$ and $\boldsymbol{Q}_{j}$ denote the eigenvalues and eigenvectors matrices, respectively. A famous common structure assumes that the $k$ covariance matrices have possibly different eigenvalues but identical eigenvectors, i.e.,

[TABLE]

Model (3) was proposed by Flury (1984) and is known as the common principal components (CPC) model. Flury (1984) also derived the maximum likelihood (ML) estimators of $\boldsymbol{Q}$ and $\boldsymbol{\Lambda}_{j}$ , assuming $d$ -variate normality in each group. The asymptotic distribution of these estimators was studied by Flury (1986b). The corresponding likelihood function can be written as

[TABLE]

where $\boldsymbol{\Psi}=\left\{\boldsymbol{\mu}_{j},\boldsymbol{Q},\boldsymbol{\Lambda}_{j};\text{$ j=1,\ldots,k $}\right\}$ is the whole set of parameters of cardinality $m_{\text{N-CPC}}=kd+d\left(d-1\right)/2+kd$ , $C$ is a constant that does not depend on the parameters, and $\boldsymbol{S}_{j}=n_{j}^{-1}\sum_{i=1}^{n_{j}}\boldsymbol{S}_{ij}$ , with $\boldsymbol{S}_{ij}=\left(\boldsymbol{x}_{ij}-\boldsymbol{\mu}_{j}\right)\left(\boldsymbol{x}_{ij}-\boldsymbol{\mu}_{j}\right)^{\prime}$ . A closed-form solution for the ML estimate of $\boldsymbol{Q}$ does not exist, but the Flury-Gautschi (FG) algorithm of Flury and Gautschi (1986), which is based on pairwise rotations of orthogonal vectors (Flury and Constantine, 1985), can be used to obtain such a solution. The more appropriate the CPC model is, the more able the ML-estimated common orthogonal matrix $\boldsymbol{Q}$ is to simultaneously rotate the sample covariance matrices to nearly diagonal form. Flury (1984) also proposed a likelihood-ratio test having the CPC model under the null and the unconstrained model under the alternative. The monograph by Flury (1988) provides a rigorous treatise of the CPC and related models, detailing their properties and offering several examples, with a special focus on biometric data.

3.2 The model

However, as confirmed by the simulations of Hallin et al. (2010), the CPC model discussed above is quite sensitive to the violation of group-specific multivariate normality (see, e.g., Boente and Orellana, 2001 and Boente et al., 2002, 2006, 2009 for examples of robust CPC approaches). These violations are often due to the presence of mild outliers. Contextualizing in CPCA the definition given by Ritter (2015, p. 4 and pp. 79–80; see also , ), we can define as mild outlier in group $j$ a point that does not deviate from the reference multivariate normal distribution of that group and is not strongly outlying but, rather, it produces an overall group-specific distribution with heavier tails. Therefore, to reduce the influence of these points, more-flexible elliptically symmetric heavy-tailed distributions can be considered (Ritter, 2015, p. 4 and pp. 79). Following this idea, we consider the multivariate leptokurtic-normal distribution (Bagnato et al., 2017) in CPCA. Compared to the normal distribution, the leptokurtic-normal has an additional parameter $\beta$ governing the excess kurtosis and, advantageously with respect to other heavy-tailed elliptical distributions, its parameters correspond to quantities of direct interest (mean, covariance matrix, and excess kurtosis). Such a distribution was successfully applied in the modelling of biometric and financial data (Bagnato et al., 2017 and Maruotti et al., 2019).

The probability density function (pdf) of a $d$ -variate leptokurtic-normal (LN) distribution with mean vector $\boldsymbol{\mu}$ , covariance matrix $\boldsymbol{\Sigma}$ , and excess kurtosis $\beta\in\left[0,\beta_{\max}\right]$ , where $\beta_{\max}=\min\left[4d,4d\left(d+2\right)/5\right]$ , is given by

[TABLE]

where $f_{\text{N}}\left(\cdot;\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$ is the pdf of a $d$ -variate normal distribution with mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$ , and

[TABLE]

The constraint $\beta\in\left[0,\beta_{\max}\right]$ is the intersection of: i) $\beta\in\left[0,4d\right]$ , which assures that the pdf is positive elliptical, and ii) $\beta\in\left[0,4d\left(d+2\right)/5\right]$ , which guarantees that the pdf is unimodal. As a special case, $f_{\text{LN}}\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma},\beta\right)$ coincides with $f_{\text{N}}\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$ for $\beta=0$ .

The log-likelihood function of the CPC model based on leptokurtic-normal groups can be written as

[TABLE]

where $\boldsymbol{\Psi}=\left\{\boldsymbol{\mu}_{j},\boldsymbol{Q},\boldsymbol{\Lambda}_{j},\beta_{j};\text{$ j=1,\ldots,k $}\right\}$ is the whole set of parameters of cardinality $m_{\text{LN-CPC}}=m_{\text{N-CPC}}+k$ .

3.3 Computational details

The maximization of $l_{\text{LN-CPC}}\left(\boldsymbol{\Psi}\right)$ with respect to $\boldsymbol{\Psi}$ does not admit a closed-form solution (see Bagnato et al., 2017, for the case $k=1$ ). Furthermore, the maximization problem is constrained due to $\boldsymbol{Q}$ , $\boldsymbol{\Lambda}_{j}$ , and $\beta_{j}$ , $j=1,\ldots,k$ . Even if we marginalize the objective function with respect to $\beta_{1},\ldots,\beta_{k}$ , the FG-algorithm, which does not depend on the assumption of multivariate normality in the $k$ subpopulations (Flury, 1988, p. 71 and Section 9.3), can not be used to find the ML estimate of $\boldsymbol{Q}$ and $\boldsymbol{\Lambda}_{1},\ldots,\boldsymbol{\Lambda}_{k}$ . All these arguments give us the opportunity to appreciate the advantages of the proposed PLR decomposition.

To make the maximization of $l_{\text{LN-CPC}}\left(\boldsymbol{\Psi}\right)$ unconstrained, we follow a transformation/back-transformation approach from the original constrained parameters to unconstrained real values. The constrained orthogonal matrix $\boldsymbol{Q}$ is mapped to a $\left(d\times d\right)$ unit lower triangular matrix $\boldsymbol{L}$ , having $d\left(d-1\right)/2$ unconstrained real valued entries, via the PLR decomposition

[TABLE]

where we recall that $\boldsymbol{P}$ is a $\left(d\times d\right)$ permutation matrix and $\boldsymbol{R}$ is a $\left(d\times d\right)$ upper triangular matrix depending on $\boldsymbol{L}$ (see Definition 2.1). The back-transformation is

[TABLE]

which can be easily obtained by starting from the QR decomposition of $\boldsymbol{P}\boldsymbol{L}$ . The simple R code (R Core Team, 2018) to compute (7) and (8) is given in Appendix A.1. As concerns the generic diagonal element $\lambda_{jh}$ of $\boldsymbol{\Lambda}_{j}$ , $j=1,\ldots,k$ and $h=1,\ldots,d$ , the transformation is

[TABLE]

with $\widetilde{\lambda}_{jh}\in IR$ , while the back-transformation is

[TABLE]

Finally, regarding $\beta_{j}$ , $j=1,\ldots,k$ , the transformation is

[TABLE]

with $\widetilde{\beta}_{j}\in IR$ , while the back-transformation is

[TABLE]

Based on (7), (9) and (11), in the transformation step of our procedure we maximize the log-likelihood function $l_{\text{LN-CPC}}$ with respect to $\boldsymbol{\mu}$ (which does not require any transformation), $\widetilde{\lambda}_{jh}$ , $\widetilde{\beta}_{j}$ , and to the elements below the diagonal of $\boldsymbol{L}$ , $j=1,\ldots,k$ and $h=1,\ldots,d$ . Operationally, we perform this unconstrained maximization via the general-purpose optimizer optim() for R, included in the stats package. We try two different algorithms (Nelder-Mead and BFGS) for maximization. They can be passed to optim() via the argument method. In the back-transformation step of our procedure, the values of $\boldsymbol{L}$ , $\widetilde{\lambda}_{jh}$ , and $\widetilde{\beta}_{j}$ maximizing $l_{\text{LN-CPC}}$ can be simply inserted in (7), (9) and (11), respectively, in order to obtain the ML estimates of $\boldsymbol{Q}$ , $\lambda_{jh}$ , and $\beta_{j}$ , $j=1,\ldots,k$ and $h=1,\ldots,d$ .

Initial (real) values are required by optim() for maximization. We use the group-specific sample mean vectors for $\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{k}$ . For $\boldsymbol{Q}$ and $\boldsymbol{\Lambda}_{1},\ldots,\boldsymbol{\Lambda}_{k}$ we adopt the the simple intuitive procedure proposed by Krzanowski (1984), which is based on the PCA of the pooled sample covariance matrix and the total sample covariance matrix, followed by comparison of their eigenvectors. Finally, to initialize $\beta_{j}$ , $j=1,\ldots,k$ , we use the empirical excess kurtosis if it falls in $\left[0,\beta_{\max}\right]$ (cf. Section 3.2); instead, we put $\beta_{j}=0$ , or $\beta_{j}=\beta_{\max}$ , if the empirical excess kurtosis is lower than 0, or greater than $\beta_{\max}$ , respectively. By means of (8), (10) and (12), the obtained initial values are transformed so to be passed to optim(). From the transformation (7) related to the initial orthogonal matrix $\boldsymbol{Q}$ , we also obtain the permutation matrix $\boldsymbol{P}$ that will be used by optim(). Note that, fixing the permutation matrix to the initial one does not reduce the space of orthogonal matrices considered by optim(), but simply affects the order of the eigenvalues on the diagonal of $\boldsymbol{\Lambda}_{j}$ , $j=1,\ldots,k$ . This means that the ML estimated eigenvalues may not be simultaneously ordered in decreasing order in all groups. However, having eigenvalues in an arbitrary order is not an issue in CPCA (Trendafilov, 2010).

4 Application to allometric studies

Allometric studies are a natural area of application of the leptokurtic-normal CPCA proposed in Section 3. Allometry can be roughly devised as a tool to study the relation between parts (morphometric variables) in various organisms (Huxley, 1993). According to Jolicoeur (1963), allometry can be summarized by the first principal component (PC) of the log-transformed measurements. For the practical and theoretical reasons why it is often useful to transform data to logarithms see, e.g., Pimentel (1979), Reyment (1991), and Bookstein (1997).

When the study involves several groups of specimens, e.g. different sexes or species, the interest is comparing the group-specific allometric patterns. This aim can be handled by comparing the group-specific PCs (see, e.g., Klingenberg, 1996). Comparisons of allometry within several groups often show that the PCs differ only minimally. In these cases, it may be feasible to use CPCA, where the groups are assumed to share a common allometric pattern, i.e., that the major axes of their scatters are parallel (Klingenberg et al., 1996). The amount of variation associated with each PC can, instead, vary between groups. However, classical CPCA implies groups having a multivariate normal distribution, and this could be rather restrictive in some cases (cf. Section 3.2).

Motivated by the above considerations, and using classical real biometric data, we compare: the CPC model based on normal groups (N-CPC), the CPC model based on leptokurtic-normal groups (LN-CPC), the model with unconstrained covariance matrices and normal groups (N-PC), and the model with unconstrained covariance matrices and leptokurtic-normal groups (LN-PC). The whole analysis is conducted in R. Parameters of the competing models are estimated by the ML approach. For uniformity sake, the PLR decomposition is adopted for both the CPC approaches. For the N-CPC model, the transformation/back-transformation approach based on the PLR decomposition provided the same estimates of $\boldsymbol{Q}$ of the FG-algorithm in all our analyses (also those not reported in this paper). Having the competing models a differing number of parameters, their comparison is accomplished, as usual, via the Akaike information criterion (AIC; Akaike, 1974) and the Bayesian information criterion (BIC; Schwarz, 1978) that, in our formulation, need to be maximized. Moreover, likelihood-ratio (LR) tests are used to compare nested models, and this gives rise to new testing procedures. Just as an example, the LR test having the N-CPC model under the null and the LN-PC model as alternative represents a more omnibus version of the LR test proposed by Flury (1984) where a more restrictive N-PC model is considered under the alternative. Using Wilks’ theorem, the LR statistic, under the null, is distributed approximately as a $\chi^{2}$ with degrees of freedom given by the difference in the number of estimated parameters between the alternative and the null model; this allows us to compute a $p$ -value that, in the sequel, will be always compared to the classical 5% significance level.

4.1 Skull dimensions of voles

The first analysis considers the microtus data set accompanying the Flury package (Flury, 2012) for R. The data set contains morphological measurements, for eight variables, on the skulls of 288 specimens of voles found at various places in central Europe. For 89 of the skulls, the chromosomes were analyzed to identify their membership to one of $k=2$ species; $n_{1}=43$ specimens were from microtus multiplex, and $n_{2}=46$ from microtus subterraneus. Species was not determined for the remaining 199 specimens. Airoldi et al. (1995) report a discriminant analysis and finite mixture analysis of this data set; see also Flury (2013, Examples 5.4.4 and 9.5.1).

We analyze the sample of $n=89$ labeled skulls – because we need to know the group of membership of the observations for the application of the competing models – and focus the attention on the logarithm of $d=2$ of the available measurements: the length of palatal bone (in mm/1000) and the skull width across rostrum (in mm/100). The scatter plot of the observations, with symbols diversified with respect to the species, is displayed in Figure 1.

For microtus multiplex voles, the empirical excess kurtosis is 2.341, and the Mardia test of mesokurtosis, as implemented by the mvn() function of the MVN package (Korkmaz et al., 2019), provides a $p$ -value of 0.019; this leads to a rejection of the null hypothesis of mesokurtosis. This also implies a rejection of bivariate normality. Instead, for microtus subterraneus voles, the empirical bivariate excess kurtosis is 1.370, corresponding to a $p$ -value of 0.170 for the Mardia test. So, the microtus multiplex voles motivate the need of a distribution accounting for heavier than normal tails. Moreover, Figure 1 seems to suggest that the two scatters for microtus subterraneus and microtus multiplex have approximately the same orientation (i.e., the same principal components).

To evaluate if a leptokurtic-normal distribution fits the data in each group better, and to assess our conjecture about the similarity between orientations, we proceed with the fitting of the competing models. We could have used the LN distribution only for the microtus multiplex voles, and the fitting procedure outlined in Section 3.3 would have been easily adapted to the case. However, this would have gone beyond the scope of this real data application, which is to show the versatility of our PLR decomposition, jointly with any unconstrained optimization routine, in the estimation of an orthogonal matrix, regardless of the model structure. Table 1 reports the ML-estimated parameters.

As we can note, the models are very similar in terms of estimated mean vectors. Instead, they differ in terms of estimated eigenvectors and eigenvalues matrices (compare N-CPC with LN-CPC and N-PC with LN-PC). This shows how the leptokurtik-normal distributional assumption robustifies the estimation of these matrices with respect to normally distributed groups. The robustifying effect can be particularly noted comparing N-PC with LN-PC models; here, the estimates of the orthogonal matrices differ mainly in group 1 (microtus multiplex), where a larger empirical/estimated excess kurtosis is present. Finally, as concerns the models based on the leptokurtic-normal distribution, the estimates of the excess kurtoses $\beta_{1}$ and $\beta_{2}$ are in line with the empirical excess kurtoses (2.341 and 1.370).

Table 2 presents a model comparison in terms of: number of parameters ( $m$ ), log-likelihood, AIC, and BIC values (Table 3(a)) and with respect to the $p$ -values from the LR tests (Table 3(b)).

AIC and BIC in Table 3(a) indicate LN-CPC as the best model. This means that, with respect to a model with unconstrained covariance matrices, a more parsimonious model allowing for common principal components is to be preferred, but without giving up to groups having a heavier-tailed distribution (the leptokurtic-normal in this case). By looking at Table 3(b), the null N-CPC model of the LR test by Flury (1984) is not rejected ( $p$ -value=0.139). This happens because of the alternative N-PC model used by the test. If we make the test more omnibus by considering the LN-PC model as alternative, then the conclusion changes ( $p$ -value=0.020); interestingly, if we change the null hypothesis of this test using a less constrained LN-CPC model, then the conclusion is in favor of the null ( $p$ -value=0.459). The null N-CPC model is also rejected, with a greater extent, if we use LN-CPC as alternative ( $p$ -value=0.010). Finally, it is also interesting to note that, even if we do not consider a CPC approach, the need for leptokurtic-normal groups arises: the $p$ -value of the test of N-PC versus LN-PC is, indeed, 0.022.

4.2 Head dimensions of young Swiss soldiers

The second analysis considers the swiss.soldiers data set considered by Flury (1988, Example 2.3). The data were collected by a group of anthropologists on approximately 900 Swiss soldiers, most of them recruits, subdivided in $k=2$ groups according to the gender. All the soldiers were 20 years old at the time of investigation and 25 variables were measured on their heads. The purpose of the study, promoted by the the Swiss government in the mid-1980s, was to provide sufficient data to construct new protection masks for the members of the Swiss army.

We start by the subset of the swiss.soldiers data which can be obtained by merging the swiss.head (referred to men) and f.swiss.head (referred to women) data sets included in the Flury package. The merged data contain the values of 6 head measurements for a subsample of $n=259$ soldiers, with $n_{1}=59$ women and $n_{2}=200$ men. A detailed analysis for the men has been given by Flury (2011, Example 10.2) and Flury (2013, Example 1.2). In particular, we focus on the logarithm of $d=3$ of the available head measurements: the minimal frontal breadth (MFB), the true facial height (TFH), and the length from tragion to gnathion (LTG). All measurements are in millimeters. The matrix of pairwise scatter plots, with symbols diversified with respect to the gender, is displayed in Figure 2.

For women, the empirical excess kurtosis is 1.130, and the Mardia test of mesokurtosis provides a $p$ -value of 0.259. As concerns men, the empirical excess kurtosis is 7.621 and the Mardia test of mesokurtosis provides a practically null $p$ -value indicating that a leptokurtic distribution should be better than the normal for that group. So, in this example, the group of men justifies the use of the leptokurtic-normal distribution.

Table 3 shows the ML estimated parameters for the competing models.

As for the example on the microtus data of Section 4.1, the models behave in a similar way in terms of estimated mean vectors, but differ in terms of estimated eigenvectors and eigenvalues matrices if N-CPC is compared with LN-CPC and N-PC with LN-PC. Such a difference is due to the leptokurtic-normal distributional assumption. For the LN-based models, the estimates of the excess kurtoses $\beta_{1}$ and $\beta_{2}$ are in line enough with the empirical excess kurtoses (1.130 and 7.621).

Table 4 presents the model comparison.

AIC and BIC in Table 5(a) select LN-CPC as the best model, with a stronger extent with respect to the previous example. Such a greater extent has analogous implications for the $p$ -values from the LR tests (see Table 5(b)). The null N-CPC model is not rejected if N-PC is considered as alternative model ( $p$ -value=0.278), but it is strongly rejected (with an approximately null $p$ -value) if a LN-based model (LN-CPC or LN-PC) is considered under the alternative. If LN-CPC is considered as the null model, and LN-PC as alternative, then the conclusion is in favor of the null ( $p$ -value=0.408). Even if we do not consider a CPC approach, the need for leptokurtic-normal groups is even more evident in this example: the $p$ -value of the test of N-PC versus LN-PC is, indeed, practically null.

5 Conclusions

Estimating and modelling a $\left(d\times d\right)$ covariance matrix $\boldsymbol{\Sigma}$ is often difficult because of the constraint that $\boldsymbol{\Sigma}$ must be positive definite. The Cholesky decomposition provides a remedy to this problem by mapping the $d\left(d+1\right)/2$ constrained parameters of $\boldsymbol{\Sigma}$ with the $d\left(d+1\right)/2$ unconstrained real-valued elements of a $\left(d\times d\right)$ lower triangular matrix (Pourahmadi, 1999, 2000 and Pourahmadi et al., 2007). Analogously, estimating and modelling a $\left(d\times d\right)$ orthogonal matrix $\boldsymbol{Q}$ is often cumbersome because of its orthonormality constraints. Unfortunately, in this case, an analogous of the Cholesky decomposition does not exist. In this paper we have filled this gap by providing a decomposition for orthogonal matrices, that we have called PLR decomposition, mapping $\boldsymbol{Q}$ to a $\left(d\times d\right)$ unit lower triangular matrix with $d\left(d-1\right)/2$ unconstrained real entries below the diagonal.

For illustrative purposes, we have applied our PLR decomposition in common principal component analysis (CPCA), based on groups having a heavier-tails leptokurtic-normal distribution, for maximum likelihood estimation of the common orthogonal matrix. We have chosen allometry as a natural area of application of the resulting leptokurtic-normal CPCA and the real data analyses have effectively confirmed its good behavior when compared to the classical normal CPCA.

However, the use of the PLR decomposition of an orthogonal matrix is not limited to CPCA, and other statistical models may benefit from its use. Indeed, the PLR decomposition may be used to simplify the ML estimation of the orthogonal matrix related, only to mention a few, to: CPCA based on further non-normal distributions for the groups, other multiple group models allowing for common covariance structures (Flury, 1986a and Greselin and Punzo, 2013), parsimonious model-based clustering, classification and discriminant analysis (Banfield and Raftery, 1993, Flury et al., 1994, Celeux and Govaert, 1995, Fraley and Raftery, 2002, Andrews and McNicholas, 2012, Bagnato et al., 2014, Lin, 2014, Vrbik and McNicholas, 2014, Dang et al., 2015, Punzo et al., 2016, Punzo and McNicholas, 2016, Punzo et al., 2018, Dotto and Farcomeni, 2019), extensions of hidden Markov models (Punzo and Maruotti, 2016 and Maruotti and Punzo, 2017), and sophisticated multivariate distributions (Forbes and Wraith, 2014 and Punzo and Tortora, 2018). We pursue to handle these possibilities in future works.

Appendix A

A.1 PLR decomposition in R

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Airoldi et al. (1995) Airoldi, J.-P., Flury, B., and Salvioni, M. (1995). Discrimination between two species of microtususing both classified and unclassified observations. Journal of Theoretical Biology , 177 (3), 247–262.
2Akaike (1974) Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control , 19 (6), 716–723.
3Andrews and Mc Nicholas (2012) Andrews, J. L. and Mc Nicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis with the multivariate t 𝑡 t -distribution: The t 𝑡 t EIGEN family. Statistics and Computing , 22 (5), 1021–1029.
4Bagnato et al. (2014) Bagnato, L., Greselin, F., and Punzo, A. (2014). On the spectral decomposition in normal discriminant analysis. Communications in Statistics - Simulation and Computation , 43 (6), 1471–1489.
5Bagnato et al. (2017) Bagnato, L., Punzo, A., and Zoia, M. G. (2017). The multivariate leptokurtic-normal distribution and its application in model-based clustering. Canadian Journal of Statistics , 45 (1), 95–119.
6Banerjee and Roy (2014) Banerjee, S. and Roy, A. (2014). Linear Algebra and Matrix Analysis for Statistics . Chapman & Hall/CRC Texts in Statistical Science. CRC Press.
7Banfield and Raftery (1993) Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics , 49 (3), 803–821.
8Boente and Orellana (2001) Boente, G. and Orellana, L. (2001). A robust approach to common principal components. In Statistics in Genetics and in the Environmental Sciences , Trends in Mathematics, pages 117–145. Springer, Birkhäuser, Basel.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Unconstrained representation of orthogonal matrices with application to common principle components

Abstract

1 Introduction

2 PLR decomposition of orthogonal matrices

Definition 2.1** (QR decomposition).**

Definition 2.2** (PLU decomposition).**

Theorem 2.1** (PLR decomposition).**

Proof.

3 Leptokurtic-normal common principal components

3.1 Preliminaries

3.2 The model

3.3 Computational details

4 Application to allometric studies

4.1 Skull dimensions of voles

4.2 Head dimensions of young Swiss soldiers

5 Conclusions

Appendix A

A.1 PLR decomposition in R

Definition 2.1 (QR decomposition).

Definition 2.2 (PLU decomposition).

Theorem 2.1 (PLR decomposition).