Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t   Distributions for Model-Based Clustering with Incomplete Data

Yuhong Wei; Yang Tang; Paul D. McNicholas

arXiv:1703.02177·stat.ME·November 13, 2018·Comput. Stat. Data Anal.

Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data

Yuhong Wei, Yang Tang, Paul D. McNicholas

PDF

TL;DR

This paper introduces flexible mixture models based on generalized hyperbolic and skew-t distributions for robust clustering of incomplete, heavy-tailed, and asymmetric data, with an EM algorithm for parameter estimation and missing data imputation.

Contribution

It develops an analytically feasible EM algorithm for mixture models with missing data, extending robust clustering methods to handle arbitrary missing patterns.

Findings

01

The proposed methods outperform mean imputation in clustering accuracy.

02

Simulation studies demonstrate robustness across various missing data proportions.

03

Real data application confirms practical effectiveness.

Abstract

Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavy-tailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for model-based clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skew-t distributions. An analytically feasible EM algorithm is formulated for parameter estimation and imputation of missing values for mixture models employing missing at random mechanisms. The proposed methodologies are investigated through a simulation study with varying proportions of synthetic missing values and illustrated using a real dataset. Comparisons are made with those obtained from the traditional mixture of generalized hyperbolic distribution counterparts by filling in the missing data using the…

Figures3

Click any figure to enlarge with its caption.

Tables10

Table 1. Table 1 : Summary of simulated datasets.

Dataset	Distribution	Covariance structure $(𝚺_{g})$	Separation between components
Sim1	MGHD	VEE	Well-separated
Sim2	MGHD	VEE	Overlapping
Sim3	MST	VEI	Well-separated
Sim4	MST	VEI	Overlapping
Sim5	GMM	VEE	Well-separated
Sim6	GMM	VEE	Overlapping

Table 2. Table 2 : Number of missing observations for each pattern.

$r$	Pattern 1	Pattern 2
$5 %$	(10,3,6,1)	(1,6,3,10)
$15 %$	(30,9,18,3)	(3,18,9,30)
$30 %$	(60,18,36,6)	(6,36,18,60)

Table 3. Table 3 : A comparison of averaged BIC, ARI and the number of times (nt) when G = 2 𝐺 2 G=2 is chosen among MGHD, MST, and M t models on the tumour dataset with G = 1 , … , 4 𝐺 1 … 4 G=1,\ldots,4 .

	$r = 0.05$			$r = 0.15$
	Avg.BIC	Avg.ARI	nt	Avg.BIC	Avg.ARI	nt
MGHD	$12145$	$0.65$	$18$	$9654$	$0.58$	$16$
MST	$12661$	$0.55$	$15$	$10574$	$0.56$	$16$
Mt	$13605$	$0.47$	$10$	$11605$	$0.36$	$10$

Table 4. Table 4 : A description of Pima Indian diabetes dataset.

	No. missing values	Sample mean	Sample std. dev.
Number of times pregnant	0	3.85	3.37
Plasma glucose concentration	5	120.89	31.97
Diastolic blood pressure (mm Hg)	35	69.11	19.36
Triceps skin fold thickness (mm)	227	20.54	15.95
2-hour serum insulin(mu U/mL)	374	79.80	115.24
Body mass index	11	31.99	7.88
Diabetes pedigree function	0	0.47	0.33
Age (years)	0	33.24	11.76

Table 5. Table 5 : The BIC, ICL, selected 𝚺 g subscript 𝚺 𝑔 \mbox{\boldmath$\Sigma$}_{g} and the correct classification rate for our proposed approaches for clustering on the Pima Indian diabetes dataset.

	$𝚺_{g}$	BIC	ICL	Accuracy
MGHD	EVE	$- 14016.95$	$- 14053.61$	69.11%
MST	VVI	$- 14109.1$	$- 14186.1$	62.37%

Table 6. Table 6 : Summary of key model parameter estimates (standard errors) for the best chosen model (i.e., MGHD with 𝚺 g = subscript 𝚺 𝑔 absent \mbox{\boldmath$\Sigma$}_{g}= EVE) for the Pima Indian diabetes dataset.

Parameter	$g = 1$	$g = 2$
$μ_{1 g}$	$- 0.80$ (0.11)	2.98 (1.78)
$μ_{2 g}$	$- 0.97$ (0.22)	1.35 (4.01)
$μ_{3 g}$	$- 0.69$ (0.14)	1.10 (2.65)
$μ_{4 g}$	0.15 (0.08)	$- 0.50$ (4.59)
$μ_{5 g}$	$- 1.26$ (1.73)	0.18 (0.25)
$μ_{6 g}$	$- 0.66$ (0.07)	0.57 (0.78)
$μ_{7 g}$	$- 0.74$ (0.12)	$- 2.67$ (8.41)
$μ_{8 g}$	$- 1.20$ (0.31)	$- 2.01$ (2.04)
$β_{1 g}$	0.57 (0.05)	$- 2.18$ (1.79)
$β_{2 g}$	0.77 (0.47)	$- 0.92$ (0.25)
$β_{3 g}$	0.54 (0.40)	$- 0.78$ (1.18)
$β_{4 g}$	0.53 (0.31)	0.58 (0.38)
$β_{5 g}$	0.11 (0.13)	0.11 (0.32)
$β_{6 g}$	0.57 (0.16)	$- 0.37$ (0.51)
$β_{7 g}$	0.63 (0.18)	1.27 (0.46)
$β_{8 g}$	0.87 (0.16)	2.91 (1.85)
$ω_{g}$	2.39 (1.81)	14.18 (6.83)
$λ_{g}$	0.02 (0.34)	$- 3.18$ (4.60)
$π_{g}$	0.71 (0.09)	0.29 (0.10)

Table 7. Table 7: The nomenclature and scale matrix structure for each member of the GPCM family.

Nomenclature	Volume	Shape	Orientation	$𝚺_{g}$
EII	Equal	Spherical		$λ 𝐈$
VII	Variable	Spherical		$λ_{g} 𝐈$
EEI	Equal	Equal	Axis-Aligned	$λ 𝚫$
VEI	Variable	Equal	Axis-Aligned	$λ_{g} 𝚫$
EVI	Equal	Variable	Axis-Aligned	$λ 𝚫_{g}$
VVI	Variable	Variable	Axis-Aligned	$λ_{g} 𝚫_{g}$
EEE	Equal	Equal	Equal	$λ {𝚪 𝚫 𝚪}^{'}$
VEE	Variable	Equal	Equal	$λ_{g} {𝚪 𝚫 𝚪}^{'}$
EVE	Equal	Variable	Equal	$λ {𝚪 𝚫}_{g} 𝚪^{'}$
EEV	Equal	Equal	Variable	$λ 𝚪_{g} {𝚫 𝚪}_{g}^{'}$
VVE	Variable	Variable	Equal	$λ_{g} {𝚪 𝚫}_{g} 𝚪^{'}$
VEV	Variable	Equal	Variable	$λ_{g} 𝚪_{g} {𝚫 𝚪}_{g}^{'}$
EVV	Equal	Variable	Variable	$λ 𝚪_{g} 𝚫_{g} 𝚪_{g}^{'}$
VVV	Variable	Variable	Variable	$λ_{g} 𝚪_{g} 𝚫_{g} 𝚪_{g}^{'}$

Table 8. Table 8: Key model parameters as well as means, standard deviations and bias of the associated parameter estimations from the 100 runs for the first simulation experiment.

Sim1 using MGHD
$r = 0.05$
	Pattern 1			Pattern 2			MCAR
	Mean	Std. dev	Bias	Mean	Std. dev.	Bias	Mean	Std. dev.	Bias
$𝝁_{1}$	${(0.70, - 3.30)}^{'}$	${(1.06, 0.94)}^{'}$	${(- 0.30, - 0.30)}^{'}$	${(0.45, - 3.40)}^{'}$	${(1.23, 1.08)}^{'}$	${(- 0.55, - 0.40)}^{'}$	${(0.64, - 3.22)}^{'}$	${(0.86, 0.79)}^{'}$	${(- 0.36, - 0.22)}^{'}$
$𝝁_{2}$	${(- 0.74, 3.47)}^{'}$	${(2.11, 2.49)}^{'}$	${(0.26, 0.47)}^{'}$	${(- 0.57, 3.41)}^{'}$	${(2.62, 2.17)}^{'}$	${(0.43, 0.41)}^{'}$	${(- 0.64, 3.42)}^{'}$	${(2.54, 2.20)}^{'}$	${(0.36, 0.42)}^{'}$
$𝜷_{1}$	${(1.59, 1.59)}^{'}$	${(1.30, 1.16)}^{'}$	${(0.59, 0.59)}^{'}$	${(1.91, 1.74)}^{'}$	${(1.49, 1.35)}^{'}$	${(0.91, 0.74)}^{'}$	${(1.67, 1.50)}^{'}$	${(1.06, 0.98)}^{'}$	${(0.67, 0.50)}^{'}$
$𝜷_{2}$	${(- 1.73, - 1.98)}^{'}$	${(2.50, 2.92)}^{'}$	${(- 0.73, - 0.98)}^{'}$	${(- 1.96, - 1.91)}^{'}$	${(3.10, 2.56)}^{'}$	${(- 0.96, - 0.91)}^{'}$	${(- 1.86, - 1.94)}^{'}$	${(2.32, 2.09)}^{'}$	${(- 0.86, - 0.94)}^{'}$
$𝝁_{1} + 𝜷_{1}$	${(2.29, - 1.71)}^{'}$	${(0.26, 0.25)}^{'}$	${(0.29, 0.29)}^{'}$	${(2.36, - 1.66)}^{'}$	${(0.32, 0.32)}^{'}$	${(0.36, 0.34)}^{'}$	${(2.31, - 1.71)}^{'}$	${(0.25, 0.26)}^{'}$	${(0.31, 0.29)}^{'}$
$𝝁_{2} + 𝜷_{2}$	${(- 2.47, 1.48)}^{'}$	${(0.44, 0.47)}^{'}$	${(- 0.47, - 0.52)}^{'}$	${(- 2.54, 1.50)}^{'}$	${(0.51, 0.43)}^{'}$	${(- 0.54, - 0.50)}^{'}$	${(- 2.50, 1.48)}^{'}$	${(0.50, 0.54)}^{'}$	${(- 0.50, - 0.52)}^{'}$
$𝚺_{1}$	$[\begin{matrix} 1.88 & 1.51 \\ 1.51 & 1.90 \end{matrix}]$	$[\begin{matrix} 0.32 & 0.27 \\ 0.27 & 0.30 \end{matrix}]$	$[\begin{matrix} 0.21 & 0.18 \\ 0.18 & 0.23 \end{matrix}]$	$[\begin{matrix} 1.96 & 1.57 \\ 1.57 & 1.98 \end{matrix}]$	$[\begin{matrix} 0.34 & 0.28 \\ 0.28 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.29 & 0.24 \\ 0.24 & 0.31 \end{matrix}]$	$[\begin{matrix} 1.95 & 1.57 \\ 1.57 & 1.97 \end{matrix}]$	$[\begin{matrix} 0.36 & 0.30 \\ 0.30 & 0.34 \end{matrix}]$	$[\begin{matrix} 0.28 & 0.25 \\ 0.25 & 0.30 \end{matrix}]$
$𝚺_{2}$	$[\begin{matrix} 4.42 & 3.55 \\ 3.55 & 4.48 \end{matrix}]$	$[\begin{matrix} 0.66 & 0.58 \\ 0.58 & 0.68 \end{matrix}]$	$[\begin{matrix} 1.09 & 0.88 \\ 0.88 & 1.15 \end{matrix}]$	$[\begin{matrix} 4.38 & 3.53 \\ 3.53 & 4.43 \end{matrix}]$	$[\begin{matrix} 0.76 & 0.66 \\ 0.66 & 0.76 \end{matrix}]$	$[\begin{matrix} 1.05 & 0.86 \\ 0.86 & 1.10 \end{matrix}]$	$[\begin{matrix} 4.43 & 3.56 \\ 3.56 & 4.50 \end{matrix}]$	$[\begin{matrix} 0.68 & 0.58 \\ 0.58 & 0.68 \end{matrix}]$	$[\begin{matrix} 1.10 & 0.89 \\ 0.89 & 1.17 \end{matrix}]$
$λ_{1}$	$- 2.26$	$1.27$	$- 1.76$	$- 2.70$	$1.70$	$- 2.20$	$- 2.51$	$1.26$	$- 2.01$
$λ_{2}$	$2.93$	$1.40$	$1.93$	$2.79$	$1.40$	$1.79$	$2.88$	$1.40$	$1.88$
$π_{1}$	$0.50$	$0.00$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$
$π_{2}$	$0.50$	$0.00$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$
ARI	$0.96$	$0.02$		$0.96$	$0.02$		$0.95$	$0.09$
$r = 0.15$
	Pattern 1			Pattern 2			MCAR
	Mean	Std. dev	Bias	Mean	Std. dev.	Bias	Mean	Std. dev.	Bias
$𝝁_{1}$	${(0.81, - 3.14)}^{'}$	${(1.14, 1.14)}^{'}$	${(- 0.19, - 0.14)}^{'}$	${(0.72, - 3.24)}^{'}$	${(1.10, 0.89)}^{'}$	${(- 0.28, - 0.24)}^{'}$	${(0.75, - 3.23)}^{'}$	${(1.24, 1.11)}^{'}$	${(- 0.25, - 0.23)}^{'}$
$𝝁_{2}$	${(- 0.69, 3.45)}^{'}$	${(2.58, 2.25)}^{'}$	${(0.31, 0.45)}^{'}$	${(- 0.49, 3.47)}^{'}$	${(2.90, 3.00)}^{'}$	${(0.51, 0.47)}^{'}$	${(- 0.67, 3.34)}^{'}$	${(1.95, 1.78)}^{'}$	${(0.33, 0.34)}^{'}$
$𝜷_{1}$	${(1.46, 1.41)}^{'}$	${(1.40, 1.41)}^{'}$	${(0.46, 0.41)}^{'}$	${(1.38, 1.11)}^{'}$	${(1.49, 1.35)}^{'}$	${(0.38, 0.11)}^{'}$	${(1.59, 1.50)}^{'}$	${(1.59, 1.50)}^{'}$	${(0.59, 0.50)}^{'}$
$𝜷_{2}$	${(- 1.77, - 1.95)}^{'}$	${(3.01, 2.61)}^{'}$	${(- 0.77, - 0.95)}^{'}$	${(- 2.06, - 1.99)}^{'}$	${(3.40, 3.50)}^{'}$	${(- 1.06, - 0.99)}^{'}$	${(- 1.72, - 1.80)}^{'}$	${(3.00, 2.62)}^{'}$	${(- 0.72, - 0.80)}^{'}$
$𝝁_{1} + 𝜷_{1}$	${(2.28, - 1.72)}^{'}$	${(0.31, 0.32)}^{'}$	${(0.28, 0.28)}^{'}$	${(2.29, - 1.69)}^{'}$	${(0.31, 0.27)}^{'}$	${(0.29, 0.31)}^{'}$	${(2.33, - 1.61)}^{'}$	${(0.45, 0.39)}^{'}$	${(0.33, 0.39)}^{'}$
$𝝁_{2} + 𝜷_{2}$	${(- 2.47, 1.51)}^{'}$	${(0.48, 0.40)}^{'}$	${(- 0.47, - 0.49)}^{'}$	${(- 2.55, 1.48)}^{'}$	${(0.53, 0.58)}^{'}$	${(- 0.55, - 0.52)}^{'}$	${(- 2.39, 1.54)}^{'}$	${(0.45, 0.50)}^{'}$	${(- 0.39, - 0.48)}^{'}$
$𝚺_{1}$	$[\begin{matrix} 1.94 & 1.55 \\ 1.55 & 1.95 \end{matrix}]$	$[\begin{matrix} 0.38 & 0.34 \\ 0.34 & 0.38 \end{matrix}]$	$[\begin{matrix} 0.27 & 0.22 \\ 0.22 & 0.28 \end{matrix}]$	$[\begin{matrix} 1.90 & 1.51 \\ 1.51 & 1.90 \end{matrix}]$	$[\begin{matrix} 0.36 & 0.30 \\ 0.30 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.23 & 0.18 \\ 0.18 & 0.23 \end{matrix}]$	$[\begin{matrix} 1.91 & 1.51 \\ 1.51 & 1.93 \end{matrix}]$	$[\begin{matrix} 0.44 & 0.33 \\ 0.33 & 0.43 \end{matrix}]$	$[\begin{matrix} 0.24 & 0.18 \\ 0.18 & 0.26 \end{matrix}]$
$𝚺_{2}$	$[\begin{matrix} 4.28 & 3.41 \\ 3.41 & 4.28 \end{matrix}]$	$[\begin{matrix} 0.72 & 0.62 \\ 0.62 & 0.69 \end{matrix}]$	$[\begin{matrix} 0.95 & 0.74 \\ 0.74 & 0.95 \end{matrix}]$	$[\begin{matrix} 4.30 & 3.43 \\ 3.43 & 4.33 \end{matrix}]$	$[\begin{matrix} 0.67 & 0.60 \\ 0.60 & 0.70 \end{matrix}]$	$[\begin{matrix} 0.74 & 0.76 \\ 0.76 & 0.77 \end{matrix}]$	$[\begin{matrix} 4.42 & 3.55 \\ 3.55 & 4.50 \end{matrix}]$	$[\begin{matrix} 0.73 & 0.68 \\ 0.68 & 0.75 \end{matrix}]$	$[\begin{matrix} 1.09 & 0.88 \\ 0.88 & 1.17 \end{matrix}]$
$λ_{1}$	$- 2.51$	$1.38$	$- 2.01$	$- 2.47$	$1.67$	$- 1.98$	$- 3.14$	$1.77$	$- 2.64$
$λ_{2}$	$3.12$	$1.76$	$2.12$	$3.24$	$1.43$	$2.24$	$2.52$	$1.44$	$1.52$
$π_{1}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$
$π_{2}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$
ARI	$0.93$	$0.02$		$0.93$	$0.02$		$0.93$	$0.09$
$r = 0.30$
	Pattern 1			Pattern 2			MCAR
	Mean	Std. dev	Bias	Mean	Std. dev.	Bias	Mean	Std. dev.	Bias
$𝝁_{1}$	${(0.65, - 3.30)}^{'}$	${(1.20, 1.22)}^{'}$	${(- 0.35, - 0.30)}^{'}$	${(0.53, - 3.44)}^{'}$	${(1.23, 1.08)}^{'}$	${(- 0.47, - 0.44)}^{'}$	${(0.39, - 3.49)}^{'}$	${(1.48, 1.53)}^{'}$	${(- 0.61, - 0.49)}^{'}$
$𝝁_{2}$	${(- 0.58, 3.27)}^{'}$	${(1.96, 2.17)}^{'}$	${(0.42, 0.27)}^{'}$	${(- 0.00, 4.09)}^{'}$	${(2.62, 2.17)}^{'}$	${(1, 1.09)}^{'}$	${(- 0.81, 3.25)}^{'}$	${(1.89, 1.95)}^{'}$	${(0.19, 0.25)}^{'}$
$𝜷_{1}$	${(1.72, 1.65)}^{'}$	${(1.52, 1.55)}^{'}$	${(0.72, 0.65)}^{'}$	${(1.82, 1.78)}^{'}$	${(1.49, 1.35)}^{'}$	${(0.82, 0.78)}^{'}$	${(1.95, 1.91)}^{'}$	${(1.81, 1.79)}^{'}$	${(0.95, 0.91)}^{'}$
$𝜷_{2}$	${(- 1.89, - 1.72)}^{'}$	${(2.35, 2.57)}^{'}$	${(- 0.89, - 0.72)}^{'}$	${(- 2.63, - 2.75)}^{'}$	${(3.10, 2.56)}^{'}$	${(- 1.63, - 1.75)}^{'}$	${(- 1.62, - 1.78)}^{'}$	${(2.26, 2.22)}^{'}$	${(- 0.62, - 0.78)}^{'}$
$𝝁_{1} + 𝜷_{1}$	${(2.37, - 1.65)}^{'}$	${(0.41, 0.42)}^{'}$	${(0.37, 0.35)}^{'}$	${(2.36, - 1.65)}^{'}$	${(0.32, 0.32)}^{'}$	${(0.36, 0.35)}^{'}$	${(2.35, - 1.58)}^{'}$	${(0.52, 0.47)}^{'}$	${(0.35, 0.42)}^{'}$
$𝝁_{2} + 𝜷_{2}$	${(- 2.47, 1.55)}^{'}$	${(0.45, 0.45)}^{'}$	${(- 0.47, - 0.45)}^{'}$	${(- 2.63, 1.34)}^{'}$	${(0.51, 0.43)}^{'}$	${(- 0.65, - 0.66)}^{'}$	${(- 2.42, 1.47)}^{'}$	${(0.59, 0.50)}^{'}$	${(- 0.42, - 0.53)}^{'}$
$𝚺_{1}$	$[\begin{matrix} 2.00 & 1.60 \\ 1.60 & 2.00 \end{matrix}]$	$[\begin{matrix} 0.41 & 0.35 \\ 0.35 & 0.38 \end{matrix}]$	$[\begin{matrix} 0.33 & 0.27 \\ 0.27 & 0.33 \end{matrix}]$	$[\begin{matrix} 1.90 & 1.51 \\ 1.51 & 1.90 \end{matrix}]$	$[\begin{matrix} 0.34 & 0.28 \\ 0.28 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.23 & 0.18 \\ 0.18 & 0.23 \end{matrix}]$	$[\begin{matrix} 2.00 & 1.60 \\ 1.60 & 1.98 \end{matrix}]$	$[\begin{matrix} 0.56 & 0.48 \\ 0.48 & 0.53 \end{matrix}]$	$[\begin{matrix} 0.33 & 0.27 \\ 0.27 & 0.31 \end{matrix}]$
$𝚺_{2}$	$[\begin{matrix} 4.44 & 3.56 \\ 3.56 & 4.45 \end{matrix}]$	$[\begin{matrix} 0.79 & 0.70 \\ 0.70 & 0.76 \end{matrix}]$	$[\begin{matrix} 1.11 & 0.89 \\ 0.89 & 1.12 \end{matrix}]$	$[\begin{matrix} 4.33 & 3.45 \\ 3.45 & 4.33 \end{matrix}]$	$[\begin{matrix} 0.76 & 0.66 \\ 0.66 & 0.76 \end{matrix}]$	$[\begin{matrix} 1.00 & 0.78 \\ 0.78 & 1.00 \end{matrix}]$	$[\begin{matrix} 4.37 & 3.49 \\ 3.49 & 4.32 \end{matrix}]$	$[\begin{matrix} 0.96 & 0.85 \\ 0.85 & 0.88 \end{matrix}]$	$[\begin{matrix} 1.04 & 0.82 \\ 0.82 & 0.99 \end{matrix}]$
$λ_{1}$	$- 2.26$	$1.34$	$- 1.76$	$- 2.73$	$1.70$	$- 2.23$	$- 2.61$	$1.90$	$- 2.11$
$λ_{2}$	$2.83$	$1.57$	$1.83$	$2.74$	$1.40$	$1.74$	$2.37$	$1.72$	$1.37$
$π_{1}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$
$π_{2}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$
ARI	$0.90$	$0.03$		$0.90$	$0.03$		$0.90$	$0.09$

Table 9. Table 9: Key model parameters as well as means, standard deviations and bias of the associated parameter estimations from the 100 runs for the first simulation experiment.

Sim3 using MST
$r = 0.05$
	Pattern 1			Pattern 2			MCAR
	Mean	Std. dev	Bias	Mean	Std. dev.	Bias	Mean	Std. dev.	Bias
$𝝁_{1}$	${(1.05, - 3.18)}^{'}$	${(0.47, 0.36)}^{'}$	${(0.05, - 0.18)}^{'}$	${(0.98, - 3.12)}^{'}$	${(0.48, 0.35)}^{'}$	${(- 0.02, - 0.12)}^{'}$	${(1.00, - 3.14)}^{'}$	${(0.47, 0.36)}^{'}$	${(0.00, - 0.14)}^{'}$
$𝝁_{2}$	${(- 0.80, 3.13)}^{'}$	${(0.58, 0.38)}^{'}$	${(0.20, 0.13)}^{'}$	${(- 0.78, 3.22)}^{'}$	${(0.58, 0.40)}^{'}$	${(0.22, 0.22)}^{'}$	${(- 0.77, 3.21)}^{'}$	${(0.58, 0.43)}^{'}$	${(0.23, 0.21)}^{'}$
$𝜷_{1}$	${(0.93, 1.25)}^{'}$	${(0.48, 0.39)}^{'}$	${(- 0.07, 0.25)}^{'}$	${(0.95, 1.23)}^{'}$	${(0.45, 0.35)}^{'}$	${(- 0.05, 0.23)}^{'}$	${(0.96, 1.20)}^{'}$	${(0.45, 0.35)}^{'}$	${(- 0.04, 0.20)}^{'}$
$𝜷_{2}$	${(- 1.06, - 1.18)}^{'}$	${(0.56, 0.38)}^{'}$	${(- 0.06, - 0.18)}^{'}$	${(- 1.16, - 1.19)}^{'}$	${(0.55, 0.42)}^{'}$	${(- 0.16, - 0.19)}^{'}$	${(- 1.13, - 1.26)}^{'}$	${(0.60, 0.47)}^{'}$	${(- 0.13, - 0.26)}^{'}$
$𝝁_{1} + 𝜷_{1}$	${(1.98, - 1.93)}^{'}$	${(0.16, 0.08)}^{'}$	${(- 0.02, 0.07)}^{'}$	${(1.93, - 1.89)}^{'}$	${(0.12, 0.07)}^{'}$	${(- 0.07, 0.11)}^{'}$	${(1.96, - 1.93)}^{'}$	${(0.12, 0.08)}^{'}$	${(- 0.04, 0.07)}^{'}$
$𝝁_{2} + 𝜷_{2}$	${(- 1.87, 1.94)}^{'}$	${(0.23, 0.10)}^{'}$	${(0.13, - 0.06)}^{'}$	${(- 1.94, 2.03)}^{'}$	${(0.22, 0.10)}^{'}$	${(0.06, 0.03)}^{'}$	${(- 1.90, 1.95)}^{'}$	${(0.26, 0.11)}^{'}$	${(0.10, - 0.05)}^{'}$
$𝚺_{1}$	$[\begin{matrix} 3.36 & 0 \\ 0 & 0.34 \end{matrix}]$	$[\begin{matrix} 0.46 & 0 \\ 0 & 0.08 \end{matrix}]$	$[\begin{matrix} 0.36 & 0 \\ 0 & 0.01 \end{matrix}]$	$[\begin{matrix} 3.33 & 0 \\ 0 & 0.34 \end{matrix}]$	$[\begin{matrix} 0.42 & 0 \\ 0 & 0.08 \end{matrix}]$	$[\begin{matrix} 0.33 & 0 \\ 0 & 0.01 \end{matrix}]$	$[\begin{matrix} 3.32 & 0 \\ 0 & 0.35 \end{matrix}]$	$[\begin{matrix} 0.52 & 0 \\ 0 & 0.09 \end{matrix}]$	$[\begin{matrix} 0.32 & 0 \\ 0 & 0.02 \end{matrix}]$
$𝚺_{2}$	$[\begin{matrix} 6.57 & 0 \\ 0 & 0.66 \end{matrix}]$	$[\begin{matrix} 1.02 & 0 \\ 0 & 0.14 \end{matrix}]$	$[\begin{matrix} 0.57 & 0 \\ 0 & - 0.01 \end{matrix}]$	$[\begin{matrix} 6.58 & 0 \\ 0 & 0.67 \end{matrix}]$	$[\begin{matrix} 1.16 & 0 \\ 0 & 0.15 \end{matrix}]$	$[\begin{matrix} 0.58 & 0 \\ 0 & 0.00 \end{matrix}]$	$[\begin{matrix} 6.51 & 0 \\ 0 & 0.67 \end{matrix}]$	$[\begin{matrix} 1.14 & 0 \\ 0 & 0.15 \end{matrix}]$	$[\begin{matrix} 0.51 & 0 \\ 0 & 0.00 \end{matrix}]$
$ν_{1}$	$8.25$	$3.18$	$1.25$	$8.14$	$3.01$	$1.14$	$8.00$	$2.77$	$1.00$
$ν_{2}$	$5.89$	$1.98$	$0.89$	$5.79$	$1.40$	$0.79$	$6.35$	$2.48$	$1.35$
$π_{1}$	$0.50$	$0.02$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.02$	$0.00$
$π_{2}$	$0.50$	$0.02$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.02$	$0.00$
ARI	$0.81$	$0.03$		$0.96$	$0.02$		$0.81$	$0.09$
$r = 0.15$
	Pattern 1			Pattern 2			MCAR
	Mean	Std. dev	Bias	Mean	Std. dev.	Bias	Mean	Std. dev.	Bias
$𝝁_{1}$	${(0.98, - 3.26)}^{'}$	${(0.65, 0.43)}^{'}$	${(- 0.02, - 0.26)}^{'}$	${(0.96, - 3.26)}^{'}$	${(0.59, 0.43)}^{'}$	${(- 0.04, - 0.26)}^{'}$	${(0.84, - 3.26)}^{'}$	${(0.90, 0.52)}^{'}$	${(- 0.16, - 0.26)}^{'}$
$𝝁_{2}$	${(- 0.76, 3.22)}^{'}$	${(0.50, 0.42)}^{'}$	${(0.24, 0.22)}^{'}$	${(- 0.89, 3.21)}^{'}$	${(0.47, 0.37)}^{'}$	${(0.11, 0.21)}^{'}$	${(- 0.73, 3.25)}^{'}$	${(0.64, 0.60)}^{'}$	${(0.27, 0.25)}^{'}$
$𝜷_{1}$	${(0.99, 1.33)}^{'}$	${(0.66, 0.45)}^{'}$	${(- 0.01, 0.33)}^{'}$	${(0.98, 1.32)}^{'}$	${(0.55, 0.44)}^{'}$	${(- 0.02, 0.32)}^{'}$	${(1.08, 1.34)}^{'}$	${(0.89, 0.60)}^{'}$	${(0.08, 0.34)}^{'}$
$𝜷_{2}$	${(- 1.09, - 1.28)}^{'}$	${(0.49, 0.44)}^{'}$	${(- 0.09, - 0.28)}^{'}$	${(- 1.02, - 1.27)}^{'}$	${(0.48, 0.38)}^{'}$	${(- 0.02, - 0.27)}^{'}$	${(- 1.12, - 1.29)}^{'}$	${(0.46, 0.52)}^{'}$	${(- 0.12, - 0.29)}^{'}$
$𝝁_{1} + 𝜷_{1}$	${(1.98, - 1.93)}^{'}$	${(0.14, 0.08)}^{'}$	${(- 0.02, 0.07)}^{'}$	${(1.94, - 1.94)}^{'}$	${(0.15, 0.08)}^{'}$	${(- 0.06, 0.06)}^{'}$	${(1.92, - 1.92)}^{'}$	${(0.25, 0.20)}^{'}$	${(- 0.08, 0.08)}^{'}$
$𝝁_{2} + 𝜷_{2}$	${(- 1.86, 1.94)}^{'}$	${(0.22, 0.10)}^{'}$	${(0.14, - 0.06)}^{'}$	${(- 1.91, 1.93)}^{'}$	${(0.22, 0.11)}^{'}$	${(0.09, 0.07)}^{'}$	${(- 1.86, 1.95)}^{'}$	${(0.32, 0.22)}^{'}$	${(0.14, - 0.05)}^{'}$
$𝚺_{1}$	$[\begin{matrix} 3.35 & 0 \\ 0 & 0.32 \end{matrix}]$	$[\begin{matrix} 0.51 & 0 \\ 0 & 0.08 \end{matrix}]$	$[\begin{matrix} 0.35 & 0 \\ 0 & - 0.01 \end{matrix}]$	$[\begin{matrix} 3.38 & 0 \\ 0 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.55 & 0 \\ 0 & 0.01 \end{matrix}]$	$[\begin{matrix} 0.38 & 0 \\ 0 & 0.00 \end{matrix}]$	$[\begin{matrix} 3.36 & 0 \\ 0 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.61 & 0 \\ 0 & 0.09 \end{matrix}]$	$[\begin{matrix} 0.36 & 0 \\ 0 & 0.00 \end{matrix}]$
$𝚺_{2}$	$[\begin{matrix} 6.84 & 0 \\ 0 & 0.65 \end{matrix}]$	$[\begin{matrix} 1.26 & 0 \\ 0 & 0.15 \end{matrix}]$	$[\begin{matrix} 0.84 & 0 \\ 0 & - 0.02 \end{matrix}]$	$[\begin{matrix} 6.79 & 0 \\ 0 & 0.65 \end{matrix}]$	$[\begin{matrix} 1.13 & 0 \\ 0 & 0.17 \end{matrix}]$	$[\begin{matrix} 0.79 & 0 \\ 0 & - 0.02 \end{matrix}]$	$[\begin{matrix} 6.55 & 0 \\ 0 & 0.63 \end{matrix}]$	$[\begin{matrix} 1.44 & 0 \\ 0 & 0.17 \end{matrix}]$	$[\begin{matrix} 0.55 & 0 \\ 0 & - 0.04 \end{matrix}]$
$ν_{1}$	$8.65$	$4.21$	$1.65$	$9.27$	$5.90$	$2.27$	$9.37$	$6.46$	$2.37$
$ν_{2}$	$6.35$	$1.91$	$1.35$	$6.13$	$1.69$	$1.13$	$6.57$	$2.88$	$1.57$
$π_{1}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.02$	$0.00$
$π_{2}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.01$	$0.00$	$0.50$	$0.02$	$0.00$
ARI	$0.78$	$0.04$		$0.78$	$0.04$		$0.74$	$0.04$
$r = 0.30$
	Pattern 1			Pattern 2			MCAR
	Mean	Std. dev	Bias	Mean	Std. dev.	Bias	Mean	Std. dev.	Bias
$𝝁_{1}$	${(0.95, - 3.30)}^{'}$	${(0.53, 0.47)}^{'}$	${(- 0.05, - 0.30)}^{'}$	${(0.96, - 3.20)}^{'}$	${(0.52, 0.42)}^{'}$	${(- 0.04, - 0.20)}^{'}$	${(0.74, - 3.21)}^{'}$	${(0.90, 0.47)}^{'}$	${(- 0.26, - 0.21)}^{'}$
$𝝁_{2}$	${(- 0.78, 3.27)}^{'}$	${(0.58, 0.41)}^{'}$	${(0.22, 0.27)}^{'}$	${(- 0.72, 3.22)}^{'}$	${(0.60, 0.38)}^{'}$	${(0.28, 0.22)}^{'}$	${(- 0.45, 3.24)}^{'}$	${(1.18, 0.51)}^{'}$	${(0.55, 0.24)}^{'}$
$𝜷_{1}$	${(1.05, 1.27)}^{'}$	${(0.51, 0.47)}^{'}$	${(0.05, 0.27)}^{'}$	${(0.96, 1.25)}^{'}$	${(0.50, 0.43)}^{'}$	${(- 0.04, 0.25)}^{'}$	${(1.10, 1.29)}^{'}$	${(0.93, 0.49)}^{'}$	${(0.10, 0.29)}^{'}$
$𝜷_{2}$	${(- 1.19, - 1.27)}^{'}$	${(1.35, 0.57)}^{'}$	${(- 0.19, - 0.27)}^{'}$	${(- 1.16, - 1.30)}^{'}$	${(0.55, 0.43)}^{'}$	${(- 0.16, - 0.30)}^{'}$	${(- 1.31, - 1.32)}^{'}$	${(1.18, 0.54)}^{'}$	${(- 0.31, - 0.32)}^{'}$
$𝝁_{1} + 𝜷_{1}$	${(2.00, - 2.03)}^{'}$	${(0.12, 0.09)}^{'}$	${(0.00, - 0.03)}^{'}$	${(1.92, - 1.94)}^{'}$	${(0.17, 0.09)}^{'}$	${(- 0.08, 0.06)}^{'}$	${(1.84, - 1.92)}^{'}$	${(0.19, 0.10)}^{'}$	${(- 0.16, 0.08)}^{'}$
$𝝁_{2} + 𝜷_{2}$	${(- 1.97, 2.00)}^{'}$	${(0.25, 0.11)}^{'}$	${(0.03, 0.00)}^{'}$	${(- 1.87, 1.92)}^{'}$	${(0.25, 0.13)}^{'}$	${(0.13, - 0.08)}^{'}$	${(- 1.76, 1.92)}^{'}$	${(0.29, 0.14)}^{'}$	${(0.24, - 0.08)}^{'}$
$𝚺_{1}$	$[\begin{matrix} 3.26 & 0 \\ 0 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.54 & 0 \\ 0 & 0.13 \end{matrix}]$	$[\begin{matrix} 0.26 & 0 \\ 0 & 0.00 \end{matrix}]$	$[\begin{matrix} 3.26 & 0 \\ 0 & 0.33 \end{matrix}]$	$[\begin{matrix} 0.52 & 0 \\ 0 & 0.10 \end{matrix}]$	$[\begin{matrix} 0.26 & 0 \\ 0 & 0.00 \end{matrix}]$	$[\begin{matrix} 3.24 & 0 \\ 0 & 0.36 \end{matrix}]$	$[\begin{matrix} 0.61 & 0 \\ 0 & 0.14 \end{matrix}]$	$[\begin{matrix} 0.24 & 0 \\ 0 & 0.03 \end{matrix}]$
$𝚺_{2}$	$[\begin{matrix} 6.53 & 0 \\ 0 & 0.67 \end{matrix}]$	$[\begin{matrix} 1.53 & 0 \\ 0 & 0.20 \end{matrix}]$	$[\begin{matrix} 0.53 & 0 \\ 0 & 0.00 \end{matrix}]$	$[\begin{matrix} 6.65 & 0 \\ 0 & 0.67 \end{matrix}]$	$[\begin{matrix} 1.23 & 0 \\ 0 & 0.18 \end{matrix}]$	$[\begin{matrix} 0.65 & 0 \\ 0 & 0.00 \end{matrix}]$	$[\begin{matrix} 6.35 & 0 \\ 0 & 0.67 \end{matrix}]$	$[\begin{matrix} 1.63 & 0 \\ 0 & 0.20 \end{matrix}]$	$[\begin{matrix} 0.35 & 0 \\ 0 & 0.00 \end{matrix}]$
$ν_{1}$	$8.56$	$4.12$	$1.56$	$8.42$	$3.42$	$1.42$	$9.19$	$4.71$	$2.19$
$ν_{2}$	$6.83$	$2.57$	$1.83$	$6.34$	$2.02$	$1.34$	$7.26$	$4.37$	$2.26$
$π_{1}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.02$	$0.00$	$0.50$	$0.02$	$0.00$
$π_{2}$	$0.50$	$0.01$	$0.00$	$0.50$	$0.02$	$0.00$	$0.50$	$0.02$	$0.00$
ARI	$0.75$	$0.05$		$0.76$	$0.05$		$0.75$	$0.05$

Table 10. Table 10: A comparsion of average BIC and ARI between MGHD, MST, and M t models (replications=100) with G = 1 , … , 4 𝐺 1 … 4 G=1,\ldots,4 .

		MGHD		MST		Mt
		BIC	ARI	BIC	ARI	BIC	ARI
Sim1	$r = 0.05$	$- 1534$	0.95	$- 1644$	0.88	$- 1663$	0.75
	$r = 0.15$	$- 1412$	0.87	$- 1517$	0.82	$- 1559$	0.69
	$r = 0.30$	$- 1230$	0.74	$- 1301$	0.69	$- 1396$	0.60
Sim2	$r = 0.05$	$- 1647$	0.73	$- 1683$	0.64	$- 1823$	0.59
	$r = 0.15$	$- 1435$	0.62	$- 1538$	0.52	$- 1677$	0.48
	$r = 0.30$	$- 1201$	0.46	$- 1266$	0.36	$- 1463$	0.36
Sim3	$r = 0.05$	$- 1667$	0.82	$- 1689$	0.76	$- 1789$	0.64
	$r = 0.15$	$- 1517$	0.76	$- 1502$	0.66	$- 1622$	0.63
	$r = 0.30$	$- 1203$	0.70	$- 1264$	0.60	$- 1410$	0.48
Sim4	$r = 0.05$	$- 1546$	0.72	$- 1608$	0.41	$- 1849$	0.33
	$r = 0.15$	$- 1333$	0.60	$- 1440$	0.37	$- 1727$	0.27
	$r = 0.30$	$- 1142$	0.12	$- 1171$	0.23	$- 1385$	0.20
Sim5	$r = 0.05$	$- 1507$	0.94	$- 1613$	0.74	$- 1619$	0.88
	$r = 0.15$	$- 1366$	0.85	$- 1507$	0.66	$- 1450$	0.78
	$r = 0.30$	$- 1193$	0.71	$- 1340$	0.59	$- 1247$	0.64
Sim6	$r = 0.05$	$- 1356$	0.68	$- 1445$	0.40	$- 1614$	0.38
	$r = 0.15$	$- 1262$	0.58	$- 1389$	0.38	$- 1522$	0.35
	$r = 0.30$	$- 1130$	0.40	$- 1263$	0.28	$- 1385$	0.29

Equations197

f_{GIG} (w ∣ λ, χ, ψ) = \frac{( ψ / χ ) ^{λ /2} w ^{λ - 1}}{2 K _{λ} ( ψ χ )} exp {- \frac{ψ w + χ / w}{2}},

f_{GIG} (w ∣ λ, χ, ψ) = \frac{( ψ / χ ) ^{λ /2} w ^{λ - 1}}{2 K _{λ} ( ψ χ )} exp {- \frac{ψ w + χ / w}{2}},

E [W^{α}] = (\frac{χ}{ψ})^{α /2} \frac{K _{λ + α} ( ψ χ )}{K _{λ} ( ψ χ )},

E [W^{α}] = (\frac{χ}{ψ})^{α /2} \frac{K _{λ + α} ( ψ χ )}{K _{λ} ( ψ χ )},

E [lo g W] = lo g (\frac{χ}{ψ}) + \frac{\partial}{\partial λ} lo g (K_{λ} (ψ χ)) .

E [lo g W] = lo g (\frac{χ}{ψ}) + \frac{\partial}{\partial λ} lo g (K_{λ} (ψ χ)) .

E [W]

E [W]

E [1 / W]

f_{I} (w ∣ λ, η, ω) = \frac{( w / η ) ^{λ - 1}}{2 η K _{λ} ( ω )} exp {- \frac{ω}{2} (\frac{w}{η} + \frac{η}{w})}

f_{I} (w ∣ λ, η, ω) = \frac{( w / η ) ^{λ - 1}}{2 η K _{λ} ( ω )} exp {- \frac{ω}{2} (\frac{w}{η} + \frac{η}{w})}

\mathbf{X}=\mbox{\boldmath$\mu$}+W\mbox{\boldmath$\alpha$}+\sqrt{W}\mathbf{U},

\mathbf{X}=\mbox{\boldmath$\mu$}+W\mbox{\boldmath$\alpha$}+\sqrt{W}\mathbf{U},

f(\mathbf{x}\mid\mbox{\boldmath$\vartheta$})=\left[\frac{\chi+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$})}{\psi+\mbox{\boldmath$\alpha$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\alpha$}}\right]^{\frac{\lambda-p/2}{2}}\frac{(\psi/\chi)^{\lambda/2}K_{\lambda-p/2}\left(\sqrt{(\chi+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$}))(\psi+\mbox{\boldmath$\alpha$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\alpha$})}\right)}{(2\pi)^{p/2}|\mbox{\boldmath$\Sigma$}|^{1/2}K_{\lambda}(\sqrt{\chi\psi})\exp\{-(\mathbf{x}-\mbox{\boldmath$\mu$})^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\alpha$}\}},

f(\mathbf{x}\mid\mbox{\boldmath$\vartheta$})=\left[\frac{\chi+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$})}{\psi+\mbox{\boldmath$\alpha$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\alpha$}}\right]^{\frac{\lambda-p/2}{2}}\frac{(\psi/\chi)^{\lambda/2}K_{\lambda-p/2}\left(\sqrt{(\chi+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$}))(\psi+\mbox{\boldmath$\alpha$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\alpha$})}\right)}{(2\pi)^{p/2}|\mbox{\boldmath$\Sigma$}|^{1/2}K_{\lambda}(\sqrt{\chi\psi})\exp\{-(\mathbf{x}-\mbox{\boldmath$\mu$})^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\alpha$}\}},

\operatorname{\mathbb{E}}(\mathbf{X})=\mbox{\boldmath$\mu$}+\operatorname{\mathbb{E}}(W)\mbox{\boldmath$\alpha$}\quad\text{and}\quad\mathbb{V}\text{ar}(\mathbf{X})=\operatorname{\mathbb{E}}(W)\mbox{\boldmath$\Sigma$}+\mathbb{V}\text{ar}(W)\mbox{\boldmath$\alpha$}\mbox{\boldmath$\alpha$}^{\intercal},

\operatorname{\mathbb{E}}(\mathbf{X})=\mbox{\boldmath$\mu$}+\operatorname{\mathbb{E}}(W)\mbox{\boldmath$\alpha$}\quad\text{and}\quad\mathbb{V}\text{ar}(\mathbf{X})=\operatorname{\mathbb{E}}(W)\mbox{\boldmath$\Sigma$}+\mathbb{V}\text{ar}(W)\mbox{\boldmath$\alpha$}\mbox{\boldmath$\alpha$}^{\intercal},

\mathbf{X}=\mbox{\boldmath$\mu$}+W\mbox{\boldmath$\beta$}+\sqrt{W}\mathbf{U},

\mathbf{X}=\mbox{\boldmath$\mu$}+W\mbox{\boldmath$\beta$}+\sqrt{W}\mathbf{U},

f(\mathbf{x}\mid\mbox{\boldmath$\vartheta$})=\left[\frac{\omega+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$})}{\omega+\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}}\right]^{\frac{\lambda-p/2}{2}}\frac{K_{\lambda-p/2}\left(\sqrt{(\omega+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$}))(\omega+\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$})}\right)}{(2\pi)^{p/2}|\mbox{\boldmath$\Sigma$}|^{1/2}K_{\lambda}(\omega)\text{exp}\{-(\mathbf{x}-\mbox{\boldmath$\mu$})^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}\}},

f(\mathbf{x}\mid\mbox{\boldmath$\vartheta$})=\left[\frac{\omega+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$})}{\omega+\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}}\right]^{\frac{\lambda-p/2}{2}}\frac{K_{\lambda-p/2}\left(\sqrt{(\omega+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$}))(\omega+\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$})}\right)}{(2\pi)^{p/2}|\mbox{\boldmath$\Sigma$}|^{1/2}K_{\lambda}(\omega)\text{exp}\{-(\mathbf{x}-\mbox{\boldmath$\mu$})^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}\}},

\displaystyle\mbox{\boldmath$\mu$}=\begin{pmatrix}\mbox{\boldmath$\mu$}_{1}\\ \mbox{\boldmath$\mu$}_{2}\end{pmatrix},

\displaystyle\mbox{\boldmath$\mu$}=\begin{pmatrix}\mbox{\boldmath$\mu$}_{1}\\ \mbox{\boldmath$\mu$}_{2}\end{pmatrix},

\mathbf{Y}\sim\text{GHD}_{k}(\lambda,\omega,\mathbf{B}\mbox{\boldmath$\mu$}+\mathbf{b},\mathbf{B}\mbox{\boldmath$\Sigma$}\mathbf{B}^{\intercal},\mathbf{B}\mbox{\boldmath$\beta$}),

\mathbf{Y}\sim\text{GHD}_{k}(\lambda,\omega,\mathbf{B}\mbox{\boldmath$\mu$}+\mathbf{b},\mathbf{B}\mbox{\boldmath$\Sigma$}\mathbf{B}^{\intercal},\mathbf{B}\mbox{\boldmath$\beta$}),

λ_{2 ∣ 1}

λ_{2 ∣ 1}

ψ_{2 ∣ 1}

\displaystyle\mbox{\boldmath$\Sigma$}_{2\mid 1}

\mathbf{X}=\mbox{\boldmath$\mu$}+W\mbox{\boldmath$\beta$}+\sqrt{W}\mathbf{U},

\mathbf{X}=\mbox{\boldmath$\mu$}+W\mbox{\boldmath$\beta$}+\sqrt{W}\mathbf{U},

f(\mathbf{x}\mid\mbox{\boldmath$\vartheta$})=\left[\frac{v+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$})}{\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}}\right]^{\frac{-v-p}{4}}\frac{v^{v/2}K_{(-v-p)/2}\left(\sqrt{(v+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$}))(\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$})}\right)}{(2\pi)^{p/2}|\mbox{\boldmath$\Sigma$}|^{1/2}\Gamma(v/2)2^{v/2-1}\text{exp}\{-(\mathbf{x}-\mbox{\boldmath$\mu$})^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}\}}.

f(\mathbf{x}\mid\mbox{\boldmath$\vartheta$})=\left[\frac{v+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$})}{\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}}\right]^{\frac{-v-p}{4}}\frac{v^{v/2}K_{(-v-p)/2}\left(\sqrt{(v+\delta(\mathbf{x},\mbox{\boldmath$\mu$}\mid\mbox{\boldmath$\Sigma$}))(\mbox{\boldmath$\beta$}^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$})}\right)}{(2\pi)^{p/2}|\mbox{\boldmath$\Sigma$}|^{1/2}\Gamma(v/2)2^{v/2-1}\text{exp}\{-(\mathbf{x}-\mbox{\boldmath$\mu$})^{\intercal}\mbox{\boldmath$\Sigma$}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}\mbox{\boldmath$\beta$}\}}.

\displaystyle\mbox{\boldmath$\mu$}=\begin{pmatrix}\mbox{\boldmath$\mu$}_{1}\\ \mbox{\boldmath$\mu$}_{2}\end{pmatrix}

\displaystyle\mbox{\boldmath$\mu$}=\begin{pmatrix}\mbox{\boldmath$\mu$}_{1}\\ \mbox{\boldmath$\mu$}_{2}\end{pmatrix}

\mathbf{Y}\sim\text{ST}_{k}(v,\mathbf{B}\mbox{\boldmath$\mu$}+\mathbf{b},\mathbf{B}\mbox{\boldmath$\Sigma$}\mathbf{B}^{\intercal},\mathbf{B}\mbox{\boldmath$\beta$}).

\mathbf{Y}\sim\text{ST}_{k}(v,\mathbf{B}\mbox{\boldmath$\mu$}+\mathbf{b},\mathbf{B}\mbox{\boldmath$\Sigma$}\mathbf{B}^{\intercal},\mathbf{B}\mbox{\boldmath$\beta$}).

λ_{2 ∣ 1}

λ_{2 ∣ 1}

ψ_{2 ∣ 1}

\displaystyle\mbox{\boldmath$\Sigma$}_{2\mid 1}

f_{\text{MGHD}}(\mathbf{x}_{i}\mid\mbox{\boldmath$\Theta$})=\sum_{g=1}^{G}\pi_{g}f_{\text{GHD}}(\mathbf{x}_{i}\mid\lambda_{g},\omega_{g},\mbox{\boldmath$\mu$}_{g},\mbox{\boldmath$\Sigma$}_{g},\mbox{\boldmath$\beta$}_{g}),

f_{\text{MGHD}}(\mathbf{x}_{i}\mid\mbox{\boldmath$\Theta$})=\sum_{g=1}^{G}\pi_{g}f_{\text{GHD}}(\mathbf{x}_{i}\mid\lambda_{g},\omega_{g},\mbox{\boldmath$\mu$}_{g},\mbox{\boldmath$\Sigma$}_{g},\mbox{\boldmath$\beta$}_{g}),

X_{i} ∣ w_{i g}, z_{i g} = 1

X_{i} ∣ w_{i g}, z_{i g} = 1

W_{i g} ∣ z_{i g} = 1

Z_{i}

l_{\text{c}}(\mbox{\boldmath$\Theta$})=\sum_{i=1}^{n}\sum_{g=1}^{G}z_{ig}\left[\log\pi_{g}+\log\phi(\mathbf{x}_{i}\mid\mbox{\boldmath$\mu$}_{g}+w_{ig}\mbox{\boldmath$\beta$}_{g},w_{ig}\mbox{\boldmath$\Sigma$}_{g})+\log h(w_{ig}\mid\lambda_{g},\omega_{g})\right].

l_{\text{c}}(\mbox{\boldmath$\Theta$})=\sum_{i=1}^{n}\sum_{g=1}^{G}z_{ig}\left[\log\pi_{g}+\log\phi(\mathbf{x}_{i}\mid\mbox{\boldmath$\mu$}_{g}+w_{ig}\mbox{\boldmath$\beta$}_{g},w_{ig}\mbox{\boldmath$\Sigma$}_{g})+\log h(w_{ig}\mid\lambda_{g},\omega_{g})\right].

\mbox{\boldmath$\Sigma$}_{g}=\begin{pmatrix}\mbox{\boldmath$\Sigma$}_{g,i}^{\text{oo}}&\mbox{\boldmath$\Sigma$}_{g,i}^{\text{om}}\\ \mbox{\boldmath$\Sigma$}_{g,i}^{\text{mo}}&\mbox{\boldmath$\Sigma$}_{g,i}^{mm}\end{pmatrix}\text{and}\quad\mbox{\boldmath$\Sigma$}_{g}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}=\begin{pmatrix}(\mbox{\boldmath$\Sigma$}_{g,i}^{\text{oo}})^{-1}&(\mbox{\boldmath$\Sigma$}_{g,i}^{\text{om}})^{-1}\\ (\mbox{\boldmath$\Sigma$}_{g,i}^{\text{mo}})^{-1}&(\mbox{\boldmath$\Sigma$}_{g,i}^{\text{mm}})^{-1}\end{pmatrix},

\mbox{\boldmath$\Sigma$}_{g}=\begin{pmatrix}\mbox{\boldmath$\Sigma$}_{g,i}^{\text{oo}}&\mbox{\boldmath$\Sigma$}_{g,i}^{\text{om}}\\ \mbox{\boldmath$\Sigma$}_{g,i}^{\text{mo}}&\mbox{\boldmath$\Sigma$}_{g,i}^{mm}\end{pmatrix}\text{and}\quad\mbox{\boldmath$\Sigma$}_{g}^{\raisebox{0.60275pt}{$\scriptscriptstyle-1$}}=\begin{pmatrix}(\mbox{\boldmath$\Sigma$}_{g,i}^{\text{oo}})^{-1}&(\mbox{\boldmath$\Sigma$}_{g,i}^{\text{om}})^{-1}\\ (\mbox{\boldmath$\Sigma$}_{g,i}^{\text{mo}})^{-1}&(\mbox{\boldmath$\Sigma$}_{g,i}^{\text{mm}})^{-1}\end{pmatrix},

\displaystyle l_{\text{c}}(\mbox{\boldmath$\Theta$})=\sum_{i=1}^{n}\sum_{g=1}^{G}z_{ig}\big{[}\log\pi_{g}+

\displaystyle l_{\text{c}}(\mbox{\boldmath$\Theta$})=\sum_{i=1}^{n}\sum_{g=1}^{G}z_{ig}\big{[}\log\pi_{g}+

\mathbf{X}_{i}^{\text{o}}\sim\sum_{g=1}^{G}\pi_{g}f_{\text{GHD},p_{i}^{\text{o}}}(\lambda_{g},\omega_{g},\mbox{\boldmath$\mu$}_{g,i}^{\text{o}},\mbox{\boldmath$\Sigma$}_{g,i}^{\text{oo}},\mbox{\boldmath$\beta$}_{g,i}^{\text{o}}),

\mathbf{X}_{i}^{\text{o}}\sim\sum_{g=1}^{G}\pi_{g}f_{\text{GHD},p_{i}^{\text{o}}}(\lambda_{g},\omega_{g},\mbox{\boldmath$\mu$}_{g,i}^{\text{o}},\mbox{\boldmath$\Sigma$}_{g,i}^{\text{oo}},\mbox{\boldmath$\beta$}_{g,i}^{\text{o}}),

\mathbf{X}_{i}^{\text{m}}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1\sim\text{GH}_{p-p_{i}^{\text{o}}}\left(\lambda_{g,i}^{\text{m}\mid\text{o}},\chi_{g,i}^{\text{m}\mid\text{o}},\psi_{g,i}^{\text{m}\mid\text{o}},\mbox{\boldmath$\mu$}_{g,i}^{\text{m}\mid\text{o}},\mbox{\boldmath$\Sigma$}_{g,i}^{\text{m}\mid\text{o}},\mbox{\boldmath$\beta$}_{g,i}^{\text{m}\mid\text{o}}\right),

\mathbf{X}_{i}^{\text{m}}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1\sim\text{GH}_{p-p_{i}^{\text{o}}}\left(\lambda_{g,i}^{\text{m}\mid\text{o}},\chi_{g,i}^{\text{m}\mid\text{o}},\psi_{g,i}^{\text{m}\mid\text{o}},\mbox{\boldmath$\mu$}_{g,i}^{\text{m}\mid\text{o}},\mbox{\boldmath$\Sigma$}_{g,i}^{\text{m}\mid\text{o}},\mbox{\boldmath$\beta$}_{g,i}^{\text{m}\mid\text{o}}\right),

λ_{g, i}^{m ∣ o}

λ_{g, i}^{m ∣ o}

ψ_{g}^{m ∣ o}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data

Yuhong Wei, Yang Tang and Paul D. McNicholas

(Dept. of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada.)

Abstract

Robust clustering from incomplete data is an important topic because, in many practical situations, real data sets are heavy-tailed, asymmetric, and/or have arbitrary patterns of missing observations. Flexible methods and algorithms for model-based clustering are presented via mixture of the generalized hyperbolic distributions and its limiting case, the mixture of multivariate skew-t distributions. An analytically feasible EM algorithm is formulated for parameter estimation and imputation of missing values for mixture models employing missing at random mechanisms. The proposed methodologies are investigated through a simulation study with varying proportions of synthetic missing values and illustrated using a real dataset. Comparisons are made with those obtained from the traditional mixture of generalized hyperbolic distribution counterparts by filling in the missing data using the mean imputation method.

Keywords: Clustering; generalized hyperbolic; missing data; mixture models; skew-t.

1 Introduction

Finite mixture models are powerful and flexible tools for discovering unobserved heterogeneity in multivariate datasets. Assuming no prior knowledge of class labels, the application of finite mixture models in this way is known as model-based clustering. As McNicholas (2016a) points out, the association between mixture models and clustering goes back at least as far as Tiedeman (1955), who uses the former as a means of defining the latter. Gaussian mixture models are historically the most popular tool for model-based clustering and dominated the literature for quite some time (e.g., Celeux and Govaert, 1995; Fraley and Raftery, 1998; McLachlan et al., 2003; Bouveyron et al., 2007; McNicholas and Murphy, 2008, 2010). The multivariate $t$ -distribution, being a heavy-tailed alternative to the multivariate Gaussian distribution, made (robust) mixture modelling based on mixtures of multivariate $t$ -distributions the most natural extension (e.g., Peel and McLachlan, 2000; Andrews and McNicholas, 2011, 2012; Steane et al., 2012; Lin et al., 2014). In many practical situations, however, real world datasets exhibit clusters that are not just heavy tailed but also asymmetric; furthermore, clusters can also be asymmetric yet not heavy tailed. Over the few past years, much attention has been paid to non-Gaussian approaches to model-based clustering and classification, including work on multivariate skew- $t$ distributions (e.g., Lin, 2010; Vrbik and McNicholas, 2012; Lee and McLachlan, 2014; Murray et al., 2014a, b, 2017b), shifted asymmetric Laplace distributions (Franczak et al., 2014), multivariate power exponential distributions (Dang et al., 2015), multivariate normal inverse Gaussian distributions (Karlis and Santourian, 2009; O’Hagan et al., 2016), generalized hyperbolic distributions (Browne and McNicholas, 2015; Morris and McNicholas, 2016; Tortora et al., 2016), and hidden truncation hyperbolic distributions (Murray et al., 2017a). A comprehensive review of model-based clustering work, up to and including some recent work on non-Gaussian mixtures, is given by McNicholas (2016b).

Unobserved or missing observations are frequently a hindrance in multivariate datasets and so developing mixture models that can accommodate incomplete data is an important issue in model-based clustering. The maximum likelihood and Bayesian approaches are two common imputation paradigms for analyzing data with incomplete observations. Herein, the missing data mechanism is assumed to be missing at random (MAR), as per Rubin (1976) and Little and Rubin (1987), meaning that the probability that a variable is missing for a particular individual depends only on the observed data and not on the value of the missing variable. Note that missing completely at random (MCAR) is a special case of MAR. Under MAR, the missing data mechanisms are ignorable for methods using the maximum likelihood approach.

The maximum likelihood approach to clustering incomplete data has been well studied and is often used, particularly for Gaussian mixture models (e.g., Ghahramani and Jordan, 1994; Lin et al., 2006; Browne et al., 2013). Wang et al. (2004) present a framework maximum likelihood estimation using an expectation-maximization (EM) algorithm (Dempster et al., 1977) to fit a mixture of multivariate $t$ -distributions with arbitrary missing data patterns, which was generalized by Lin et al. (2009) to efficient supervised learning via the parameter expanded (PX-EM) algorithm (Liu et al., 1998) through two auxiliary indicator matrices. Lin (2014) further develops a family of multivariate- $t$ mixture models with 14 eigen-decomposed scale matrices in the presence of missing data through a computationally flexible EM algorithm by incorporating two auxiliary indicator matrices. Wang and Lin (2015) uses a formulation of the mixture of skew-t distributions for model-based clustering with missing data.

We consider fitting mixtures of generalized hyperbolic distributions (MGHD) and mixtures of multivariate skew-t distributions (MST) with missing information. In each case, an EM algorithm is used for model selection. The chosen formulation of the (multivariate) generalized hyperbolic distribution (GHD) is that used by Browne and McNicholas (2015) and has formulations of several well-known distributions as special cases such as the multivariate skew- $t$ , normal inverse Gaussian, variance-gamma, Laplace, and Gaussian distributions (cf. McNeil et al., 2005). In addition to considering missing data, we develop families of MGHD and MST mixture models, each with 14 parsimonious eigen-decomposed scale matrices corresponding to the famous Gaussian parsimonious clustering models of (GPCMs; Banfield and Raftery, 1993; Celeux and Govaert, 1995); see Table 7 (Appendix A).

2 Background

2.1 Generalized Inverse Gaussian Distribution

A random variable $W\in\operatorname{\mathbb{R}}^{+}$ is said to have a generalized inverse Gaussian (GIG) distribution, introduced by (Good, 1953), with parameters $\lambda$ , $\chi$ , and $\psi$ if its probability density function is given by

[TABLE]

where $\psi,\chi\in\operatorname{\mathbb{R}}^{+},\lambda\in\operatorname{\mathbb{R}}$ , and $K_{\lambda}$ is the modified Bessel function of the third kind with index $\lambda$ . Herein, we write $W\sim\text{GIG}(\lambda,\chi,\psi)$ to indicate that a random variable $W$ has the GIG density as parameterized in (1). The GIG distribution has some attractive properties (Barndorff-Nielsen and Halgreen, 1977; Blæsild, 1978; Halgreen, 1979; Jørgensen, 1982), including the tractability of the expectations:

[TABLE]

for $\alpha\in\operatorname{\mathbb{R}}$ , and

[TABLE]

Specifically, for $\alpha=1$ and $\alpha=-1$ , we have

[TABLE]

Browne and McNicholas (2015) introduce another parameterization of the GIG distribution by setting $\omega=\sqrt{\psi\chi}$ and $\eta=\sqrt{\chi/\psi}$ . Write $W\sim\mathcal{I}(\lambda,\eta,\omega)$ ; its density is given by

[TABLE]

for $w>0$ , where $\eta\in\operatorname{\mathbb{R}}^{+}$ is a scale parameter and $\omega\in\operatorname{\mathbb{R}}^{+}$ is a concentration parameter. These two parameterizations of the GIG distribution are important ingredients for building the generalized hyperbolic distribution presented later.

2.2 Generalized Hyperbolic Distribution

Several alternative parameterizations of the GHD have appeared in the literature, e.g., Barndorff-Nielsen and Blæsild (1981), McNeil et al. (2005), and Browne and McNicholas (2015). Barndorff-Nielsen (1977) introduces the generalized hyperbolic distribution (GHD) to model the distribution of the sand grain sizes and subsequent reports described its statistical properties (e.g., Barndorff-Nielsen, 1978; Barndorff-Nielsen and Blæsild, 1981). However, under this standard parameterization, the parameters of the mixing distribution are not invariant by affine transformations. An important innovation was made by McNeil et al. (2005), who gave a new parameterization of the GHD. Under this new parameterization, the linear transformation of GHD remains in the same sub-family characterized by the parameters of the mixing distribution. However, there is an identifiability issue arising under this parameterization. To solve this problem, Browne and McNicholas (2015) give an alternative parameterization.

Following McNeil et al. (2005), a $p\times 1$ random vector $\mathbf{X}$ is said to follow a generalized hyperbolic distribution with index parameter $\lambda$ , concentration parameters $\chi$ and $\psi$ , location vector $\mu$ , dispersion matrix $\Sigma$ , and skewness vector $\alpha$ , denoted by $\mathbf{X}\sim\text{GH}_{p}(\lambda,\chi,\psi,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \alpha $})$ , if it can be represented by

[TABLE]

where $\mathbf{U}\bot W$ , $W\sim\text{GIG}(\lambda,\chi,\psi)$ , $\mathbf{U}\sim\mathcal{N}(\mathbf{0},\mbox{\boldmath$ \Sigma $})$ , and the symbol $\bot$ indicates independence. It follows that $\mathbf{X}\mid w\sim\mathcal{N}(\mbox{\boldmath$ \mu $}+w\mbox{\boldmath$ \alpha $},w\mbox{\boldmath$ \Sigma $})$ . So, the density of the generalized hyperbolic random vector $\mathbf{X}$ is given by

[TABLE]

where $\delta(\mathbf{x},\mbox{\boldmath$ \mu $}\mid\mbox{\boldmath$ \Sigma $})=(\mathbf{x}-\mbox{\boldmath$ \mu $})^{\intercal}\mbox{\boldmath$ \Sigma $}^{\raisebox{0.60275pt}{$ \scriptscriptstyle-1 $}}(\mathbf{x}-\mbox{\boldmath$ \mu $})$ is the squared Mahalanobis distance between $\mathbf{x}$ and $\mu$ , $K_{\lambda}$ is the modified Bessel function of the third kind with index $\lambda$ , and $\mbox{\boldmath$ \vartheta $}=(\lambda,\chi,\psi,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \alpha $})$ denotes the model parameters. The mean and covariance matrix of $\mathbf{X}$ are

[TABLE]

respectively, where $\operatorname{\mathbb{E}}(W)$ and $\mathbb{V}\text{ar}(W)$ are the mean and variance of the random variable $W$ , respectively.

Note that, in this parameterization, we need to hold $|\mbox{\boldmath$ \Sigma $}|=1$ to ensure identifiability. Using $|\mbox{\boldmath$ \Sigma $}|=1$ solves the identifiability problem but would be prohibitively restrictive for model-based clustering and classification applications. Hence, Browne and McNicholas (2015) develop a new parameterization of the GHD with index parameter $\lambda$ , concentration parameter $\omega$ , location vector $\mu$ , dispersion matrix $\Sigma$ , and skewness vector $\mbox{\boldmath$ \beta $}=\eta\mbox{\boldmath$ \alpha $}$ , denoted by $\mathbf{X}\sim\text{GHD}_{p}(\lambda,\omega,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ . Note that $\eta=1$ . This formulation is given by

[TABLE]

where $\mathbf{U}\bot W$ , $W\sim\text{GIG}(\omega/\eta,\omega\eta,\lambda)$ , with $\eta=1$ , and $\mathbf{U}\sim\mathcal{N}(\mathbf{0},\mbox{\boldmath$ \Sigma $})$ . Under this parameterization, the density of the generalized hyperbolic random vector $\mathbf{X}$ is

[TABLE]

where $\delta(\mathbf{x},\mbox{\boldmath$ \mu $}\mid\mbox{\boldmath$ \Sigma $})$ and $K_{\lambda-p/2}$ are as described earlier. We use this parameterization when we describe parameter estimation (cf. Section 3).

The following result shows an appealing closure property of the generalized hyperbolic distribution under affine transformation and conditioning as well as the formation of marginal distributions, which is useful for developing new methods presented later. Suppose that $\mathbf{X}$ is a $p$ -dimensional random vector having a generalized hyperbolic distribution as in (9), i.e., $\mathbf{X}\sim\text{GHD}_{p}(\lambda,\omega,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ . Assume that $\mathbf{X}$ is partitioned as $\mathbf{X}=(\mathbf{X}_{1}^{\intercal},\mathbf{X}_{2}^{\intercal})^{\intercal}$ , where $\mathbf{X}_{1}$ takes values in $\operatorname{\mathbb{R}}^{d_{1}}$ and $\mathbf{X}_{2}$ in $\operatorname{\mathbb{R}}^{d_{1}}=\operatorname{\mathbb{R}}^{p-d_{1}}$ , with

[TABLE]

where $\mathbf{X}$ , $\mu$ , and $\beta$ have similar partitions. Furthermore, $\mbox{\boldmath$ \Sigma $}_{11}$ is $d_{1}\times d_{1}$ and $\mbox{\boldmath$ \Sigma $}_{22}$ is $d_{2}\times d_{2}$ .

Proposition 1.

Affine transformation of the generalized hyperbolic distribution. If $\mathbf{X}\sim\text{GHD}_{p}(\lambda,\omega,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ and $\mathbf{Y}=\mathbf{B}\mathbf{X}+\mathbf{b}$ where $\mathbf{B}\in\operatorname{\mathbb{R}}^{k\times p}$ and $\mathbf{b}\in\operatorname{\mathbb{R}}^{p}$ , then

[TABLE]

Proof.

The result follows by substituting (8) into $\mathbf{Y}=\mathbf{B}\mathbf{X}+\mathbf{b}$ . ∎

Proposition 2.

The marginal distribution of $\mathbf{X}_{1}$ is a generalized hyperbolic distribution as in (9) with index parameter $\lambda$ , concentration parameter $\omega$ , location vector $\mbox{\boldmath$ \mu $}_{1}$ , dispersion matrix $\mbox{\boldmath$ \Sigma $}_{11}$ , and skewness vector $\mbox{\boldmath$ \beta $}_{1}$ , i.e., $\mathbf{X}_{1}\sim\text{GHD}_{d_{1}}(\lambda,\omega,\mbox{\boldmath$ \mu $}_{1},\mbox{\boldmath$ \Sigma $}_{11},\mbox{\boldmath$ \beta $}_{1})$ .

Proof.

The result follows by applying Proposition 1 and choosing $\mathbf{B}=[\mathbf{I}_{d_{1}},\mathbf{0}$ ] and $\mathbf{b}=\mathbf{0}$ . The parameters $\lambda,\omega$ inherited from the mixing distribution $W\sim\mathcal{I}(\lambda,\eta=1,\omega)$ remain the same under the affine transformation and marginal distribution. ∎

Proposition 3.

The conditional distribution of $\mathbf{X}_{2}$ given $\mathbf{X}_{1}=\mathbf{x}_{1}$ is a generalized hyperbolic distribution as in (6), i.e., $\mathbf{X}_{2}\mid\mathbf{X}_{1}=\mathbf{x}_{1}\sim\text{GH}_{d_{2}}(\lambda_{2\mid 1},\chi_{2\mid 1},\psi_{2\mid 1},\mbox{\boldmath$ \mu $}_{2\mid 1},\mbox{\boldmath$ \Sigma $}_{2\mid 1},\mbox{\boldmath$ \beta $}_{2\mid 1})$ , where

[TABLE]

The proof of Proposition 3 is given in Appendix B.

2.3 The Multivariate Skew- $t$ Distribution

There are several alternative formulations of multivariate skew-t distributions appearing in the literature (e.g., Branco and Dey, 2001; Sahu, Dey, and Branco, 2003; Murray, Browne, and McNicholas, 2014a; Lee and McLachlan, 2014). Lin and Lin (2011) develop a mixture of multivariate skew-t distributions incomplete data using the formulation of Sahu et al. (2003). Herein, the formulation of the multivariate skew- $t$ distribution arising from the generalized hyperbolic distribution is used. This formulation of the multivariate skew- $t$ distribution has been used by Murray et al. (2014a) to develop a mixture of skew- $t$ factor analyzers model.

Following McNeil et al. (2005), a $p$ x $1$ random vector $\mathbf{X}$ is said to follow a multivariate skew-t distribution with degree of freedom parameter $v$ , location vector $\mu$ , dispersion matrix $\Sigma$ , and skewness vector $\beta$ , denoted by $\mathbf{X}\sim\text{ST}_{p}(v,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ , if it can be represented by

[TABLE]

where $\mathbf{U}\bot W$ , $W\sim\text{IG}(v/2,v/2)$ , $\mathbf{U}\sim\mathcal{N}(\mathbf{0},\mbox{\boldmath$ \Sigma $})$ , with $\text{IG}(\cdot)$ denoting the inverse Gamma distribution. It follows that $\mathbf{X}\mid w\sim\mathcal{N}(\mbox{\boldmath$ \mu $}+w\mbox{\boldmath$ \beta $},w\mbox{\boldmath$ \Sigma $})$ and the pdf of the multivariate skew-t random vector $\mathbf{X}$ is given by

[TABLE]

This formulation of the multivariate skew-t distribution can be obtained as a special case of the generalized hyperbolic distribution by setting $\lambda=-v/2$ and $\chi=v$ , and letting $\psi\to 0$ . Similarly, this formulation of the multivariate skew-t distribution has a closed form under affine transformation and conditioning, and the formation of marginal distributions, which is useful for developing new methods presented later. Suppose that $\mathbf{X}$ is a $p$ -dimensional random vector having the multivariate skew-t distribution as in (12), i.e., $\mathbf{X}\sim\text{ST}_{p}(v,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ . Assume that $\mathbf{X}$ is partitioned as $\mathbf{X}=(\mathbf{X}_{1}^{\intercal},\mathbf{X}_{2}^{\intercal})^{\intercal}$ , where $\mathbf{X}_{1}$ takes values in $\operatorname{\mathbb{R}}^{d_{1}}$ and $\mathbf{X}_{2}$ in $\operatorname{\mathbb{R}}^{d_{1}}=\operatorname{\mathbb{R}}^{p-d_{1}}$ , with

[TABLE]

where $\mathbf{X}$ , $\mu$ , and $\beta$ have similar partitions. Furthermore, $\mbox{\boldmath$ \Sigma $}_{11}$ is $d_{1}\times d_{1}$ and $\mbox{\boldmath$ \Sigma $}_{22}$ is $d_{2}\times d_{2}$ .

Proposition 4.

Affine transformation of the multivariate skew-t distribution. If $\mathbf{X}\sim\text{ST}_{p}(v,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ and $\mathbf{Y}=\mathbf{B}\mathbf{X}+\mathbf{b}$ , where $\mathbf{B}\in\operatorname{\mathbb{R}}^{k\times p}$ and $\mathbf{b}\in\operatorname{\mathbb{R}}^{p}$ , then

[TABLE]

Proof.

The proof follows easily by substituting (11) into $\mathbf{Y}=\mathbf{B}\mathbf{X}+\mathbf{b}$ . ∎

Proposition 5.

The marginal distribution of $\mathbf{X}_{1}$ is a multivariate skew-t distribution as in (12) with degree of freedom parameter $v$ , location vector $\mbox{\boldmath$ \mu $}_{1}$ , dispersion matrix $\mbox{\boldmath$ \Sigma $}_{11}$ , and skewness vector $\mbox{\boldmath$ \beta $}_{1}$ , i.e., $\mathbf{X}_{1}\sim\text{ST}_{d_{1}}(v,\mbox{\boldmath$ \mu $}_{1},\mbox{\boldmath$ \Sigma $}_{11},\mbox{\boldmath$ \beta $}_{1})$ .

Proof.

The proof follows easily by applying Proposition 4 and choosing $\mathbf{B}=[\mathbf{I}_{d_{1}},\mathbf{0}$ ] and $\mathbf{b}=\mathbf{0}$ . The degree of freedom parameter $v$ inherited from the mixing distribution $W\sim\text{IG}(v/2,v/2)$ remains invariant under affine transformation and marginal distribution. ∎

Proposition 6.

The conditional distribution of $\mathbf{X}_{2}$ given $\mathbf{X}_{1}=\mathbf{x}_{1}$ is a generalized hyperbolic distribution as in (6), i.e., $\mathbf{X}_{2}\mid\mathbf{x}_{1}\sim\text{GH}_{d_{2}}(\lambda_{2\mid 1},\chi_{2\mid 1},\psi_{2\mid 1},\mbox{\boldmath$ \mu $}_{2\mid 1},\mbox{\boldmath$ \Sigma $}_{2\mid 1},\mbox{\boldmath$ \beta $}_{2\mid 1})$ , where

[TABLE]

The proof of Proposition 6 is similar to that for Proposition 3, hence is omitted. Similar results for Proposition 4, 5, and 6 have been obtained in Arellano-Valle and Genton (2010).

3 MGHD with Incomplete Data

Let $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}$ be $p$ -dimensional random variables arising from a heterogeneous population with $G$ disjoint MGHD subpopulations. That is, each $\mathbf{X}_{i}$ has the density

[TABLE]

where $\pi_{g}>0$ , such that $\sum_{g=1}^{G}\pi_{g}=1$ , are the mixing proportions, $\Theta$ denotes the model parameters, and $f_{\text{GHD}}(\mathbf{X}_{i}\mid\lambda_{g},\omega_{g},\mbox{\boldmath$ \mu $}_{g},\mbox{\boldmath$ \Sigma $}_{g},\mbox{\boldmath$ \beta $}_{g})$ is the GHD density defined in (9).

To apply the MGHD model (14) in the clustering paradigm, introduce $\mathbf{z}_{i}=(z_{i1},\ldots,z_{ig})^{\intercal}$ , where $z_{ig}=1$ if observation $i$ is in component $g$ and $z_{ig}=0$ otherwise. The corresponding random variable $\mathbf{Z}_{i}\sim\mathcal{M}(1;\pi_{1},\ldots,\pi_{G})$ , i.e., $\mathbf{Z}_{i}$ follows a multinomial distribution with one trial and cell probabilities $\pi_{1},\ldots,\pi_{G}$ .

A three-level hierarchical representation of the MGHD model (14) can be expressed by

[TABLE]

The complete-data consist of the observed $\mathbf{x}_{i}$ together with the missing group membership $z_{ig}$ and the latent $w_{ig}$ , for $i=1,\ldots,n$ and $g=1,\ldots,G$ , and the complete-data log-likelihood is given by

[TABLE]

Browne and McNicholas (2015) present an EM algorithm for parameter estimation with the MGHD when there is no missing data in $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ . We are interested in parameter estimation for the MGHD model (14) when $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ are partially observed with arbitrary missing patterns. The missing data mechanism is assumed to be MAR. Assume now that we split $\mathbf{x}_{i}$ into two components, $\mathbf{x}_{i}^{\text{o}}$ and $\mathbf{x}_{i}^{\text{m}}$ that denote the observed and missing components of $\mathbf{x}_{i}$ , respectively. In general, each data vector $\mathbf{x}_{i}$ may have a different pattern of missing features, i.e., $\mathbf{x}_{i}=(\mathbf{x}_{i}^{\text{o}_{i}\intercal},\mathbf{x}_{i}^{\text{m}_{i}\intercal})^{\intercal}$ , but can be simplified for the sake of clarity.

For each $\mathbf{x}_{i}=(\mathbf{x}_{i}^{\text{o}\intercal},\mathbf{x}_{i}^{\text{m}\intercal})^{\intercal}$ , partition the vector mean $\mbox{\boldmath$ \mu $}_{g}=(\mbox{\boldmath$ \mu $}_{g,i}^{\text{o}\intercal},\mbox{\boldmath$ \mu $}_{g,i}^{\text{m}\intercal})^{\intercal}$ , where $\mbox{\boldmath$ \mu $}_{g,i}^{\text{o}}$ and $\mbox{\boldmath$ \mu $}_{g,i}^{\text{m}}$ denote the sub-vectors of $\mbox{\boldmath$ \mu $}_{g}$ matching the observed and missing components of $\mathbf{x}_{i}$ , respectively. Similarly, the skewness vector is $\mbox{\boldmath$ \beta $}_{g}=(\mbox{\boldmath$ \beta $}_{g,i}^{\text{o}\intercal},\mbox{\boldmath$ \beta $}_{g,i}^{\text{m}\intercal})^{\intercal}$ and the covariance matrix $\mbox{\boldmath$ \Sigma $}_{g}$ as

[TABLE]

correspond to $\mathbf{x}_{i}=(\mathbf{x}_{i}^{\text{o}\intercal},\mathbf{x}_{i}^{\text{m}\intercal})^{\intercal}$ . As a result, in addition to the observed $\mathbf{x}_{i}^{\text{o}}$ , the missing group membership $z_{ig}$ , and the latent variable $w_{ig}$ , the complete-data also include the missing data $\mathbf{x}_{i}^{\text{m}}$ . In the framework of the EM algorithm, the missing data $\mathbf{x}_{i}^{\text{m}}$ are considered to be random variables that are updated in each iteration. Hence, the complete-data log-likelihood (16) is rewritten as

[TABLE]

Given (15), we establish the following:

•

The marginal distribution of $\mathbf{X}_{i}^{\text{o}}$ given is

[TABLE]

where $p_{i}^{\text{o}}$ is the dimension corresponding to the observed component $\mathbf{x}_{i}^{\text{o}}$ , which should be exactly written as $p_{i}^{\text{o}_{i}}$ but here is simplified.

•

The conditional distribution of $\mathbf{X}_{i}^{\text{m}}$ given $\mathbf{x}_{i}^{\text{o}}$ and $z_{ig}=1$ , according to Proposition 3, is

[TABLE]

where

[TABLE]

•

The conditional distribution of $\mathbf{X}_{i}^{\text{m}}$ given $\mathbf{x}_{i}^{\text{o}},w_{ig}$ , and $z_{ig}=1$ is

[TABLE]

•

The conditional distribution of $W_{i}$ given $\mathbf{x}_{i}^{\text{o}}$ and $z_{ig}=1$ is

[TABLE]

After a little algebra, we get the complete data log-likelihood function is

[TABLE]

On the $k$ th iteration of the E-step, the expected value of the complete data log-likelihood is computed given the observed data $\mathbf{x}_{1}^{\text{o}},\ldots,\mathbf{x}_{n}^{\text{o}}$ and the current parameter updates $\mbox{\boldmath$ \Theta $}^{(k)}$ . That is, we need to compute $\operatorname{\mathbb{E}}(Z_{ig}\mid\mathbf{x}_{i}^{\text{o}};\mbox{\boldmath$ \Theta $}^{(k)})$ , $\operatorname{\mathbb{E}}(W_{ig}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1;\mbox{\boldmath$ \Theta $}^{(k)})$ , $\operatorname{\mathbb{E}}(\log W_{ig}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1;\mbox{\boldmath$ \Theta $}^{(k)})$ , $\operatorname{\mathbb{E}}({1}/{W_{ig}}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1;\mbox{\boldmath$ \Theta $}^{(k)})$ , $\operatorname{\mathbb{E}}(\mathbf{X}_{i}^{\text{m}}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1,w_{i};\mbox{\boldmath$ \Theta $}^{(k)})$ , and $\operatorname{\mathbb{E}}(\mathbf{X}_{i}^{\text{m}}(\mathbf{X}_{i}^{\text{m}})^{\intercal}\mid\mathbf{x}_{i}^{\text{o}},z_{ig}=1,w_{i};\mbox{\boldmath$ \Theta $}^{(k)})$ .

First, let $\hat{z}_{ig}^{(k)}$ denote the a posteriori probability that $i$ -th observation belongs to the $g$ -th component of the mixture, based on the observed data:

[TABLE]

Given (2), (3), and (20), we have the following expectations as to the latent variable $W$ :

[TABLE]

For convenience, we use the following notation analogous to Browne and McNicholas (2015): $n_{g}^{(k)}=\sum_{i=1}^{n}\hat{z}_{ig}^{(k)}$ , $\bar{a}_{g}^{(k)}=1/n_{g}^{(k)}\sum_{i=1}^{n}\hat{z}_{ig}^{(k)}a_{ig}^{(k)}$ , $\bar{b}_{g}^{(k)}=1/n_{g}^{(k)}\sum_{i=1}^{n}\hat{z}_{ig}^{(k)}b_{ig}^{(k)}$ , and $\bar{c}_{g}^{(k)}=1/n_{g}^{(k)}\sum_{i=1}^{n}\hat{z}_{ig}^{(k)}c_{ig}^{(k)}$ . For the actual missing data $\mathbf{X}^{\text{m}}$ , we will also need the following expectations:

[TABLE]

On the $k$ -th iteration of the M-step, the expected value of the complete data log-likelihood is maximized to get the updates for the parameter estimates as follows:

[TABLE]

where

[TABLE]

and

[TABLE]

Finally, the estimates of $\lambda_{g}^{(k+1)}$ and $\omega_{g}^{(k+1)}$ are given as solutions to maximize the function

[TABLE]

and the associated updates are

[TABLE]

The family of MGHD mixture models, with 14 parsimonious eigen-decomposed scaled covariance matrices corresponding to the famous GPCM family of models is proposed (see Appendix A for a brief discussion, including nomenclature). Details on the MST with incomplete data are analogous to the MGHD with incomplete data and are provided in Appendix D.

4 Notes on Implementation

4.1 Initial values

It is well known that the EM algorithm can be heavily dependent on the initial values; indeed, good initial values of parameter estimates may speed up convergence. In this study, the following procedure for automatically generating initial values is used, unless otherwise specified.

•

Fill in the missing values based on the mean imputation method.

•

Perform $k$ -means clustering and use the resulting clustering membership to initialize the a posteriori probability $\hat{z}_{ig}^{(0)}$ . Accordingly, the initial values for the model parameters are then given by:

[TABLE]

•

Set the skewness parameter $\mbox{\boldmath$ \beta $}_{g}^{(0)}$ to be close to zero for symmetric data.

•

When applicable, we set $\omega_{g}^{(0)}=1$ and $\lambda_{g}^{(0)}=-1/2$ for the index and concentration parameters, which represents a special case of GHD (i.e., normal-inverse Gaussian) distribution, or set $v_{g}^{(0)}=50$ for the near-normality assumption.

To enhance the computational efficiency of the EM algorithm, we update the parameters per missing pattern instead of per individual. We suggest rearranging $\mathbf{X}$ according to unique patterns of the missing data. The procedure can be implemented as follows:

•

Build a binary $n$ by $p$ indicator matrix $\mathbf{R}=[r_{ij}]$ , with each entry $r_{ij}=1$ if $\mathbf{X}_{ij}$ is missing and $r_{ij}=0$ otherwise;

•

Find all unique missing patterns; and

•

Update parameters per missing pattern instead of per individual.

4.2 Model Selection and Stopping Criterion

In general, the number of mixture components $G$ is not known a priori, and needs to be estimated from the data. Two widely used model selection techniques are the Bayesian information criterion (BIC; Schwarz, 1978) and the integrated completed likelihood (ICL; Biernacki et al., 2000), which are given respectively by

[TABLE]

where $l(\hat{\mbox{\boldmath$ \Theta $}})$ is the maximized log-likelihood evaluated at the maximum likelihood estimate $\hat{\mbox{\boldmath$ \Theta $}}$ , $\rho$ is the number of free parameters, $n$ is the number of observations, $\hat{z}_{ig}$ represents the estimated a posteriori probability that $\mathbf{x}_{i}$ arises from the $g$ th component, and MAP denotes the maximum a posteriori probability such that $\text{MAP}\left\{\hat{z}_{ig}\right\}=1$ if $\text{max}_{g}\left\{\hat{z}_{ig}\right\}$ occurs in the $g$ th component and $\text{MAP}\left\{\hat{z}_{ig}\right\}=0$ otherwise. The bigger the BIC or ICL value, the better the fitted model.

The EM algorithm can be stopped iterations after the maximum number of iterations, or when the Aitken stopping criterion (Aitken, 1926) is satisfied. The Aitken acceleration at iteration $k$ is

[TABLE]

where $l^{(k)}$ is the log-likelihood at iterations $k$ . This yields an asymptotic estimate of the log-likelihood at iteration $k+1$ :

[TABLE]

(Böhning et al., 1994; Lindsay, 1995), and the EM algorithm is stopped when $l_{\infty}^{(k+1)}-l^{(k)}<\epsilon$ , provided this difference is positive (McNicholas et al., 2010).

5 Numerical Examples

Studies based on both simulated and real datasets are used to compare the clustering performance of the proposed approach. Our proposed family of models for incomplete data is compared to multivariate t mixture with ML estimation in the presence of missing values (Mt). BIC is used to select the model; models with higher values of BIC are preferable. The adjusted Rand index (ARI; Hubert and Arabie, 1985) is used to compare predicted classifications to true classes when applicable. The Rand index (Rand, 1971) is the ratio of pairwise agreements to total pairs, and the ARI corrects the Rand index to account for chance agreement. The ARI has expected value 0 under random classification and takes the value 1 for perfect class agreement. A detailed discussion of the ARI, and arguments in favour of its use, are given by Steinley (2004).

5.1 Simulation Studies

The simulated datasets are each two-component mixtures: a mixture of Gaussian distributions (GMM) with a general VEE covariance structure, a mixture of skew-t distributions (MST) with a diagonal VEI covariance structure, and a mixture of generalized hyperbolic distributions (MGHD) with a general VEE covariance structure. The GMM datasets are generated via the R function rmvnorm from the mvtnorm package for R, and the MST and MGHD datasets are generated using R code based on the stochastic representations in (11) and (8), respectively.

For each mixture component, $n_{g}=200$ two-dimensional vectors $\mathbf{x}_{i}$ are generated. The presumed parameters of $\mbox{\boldmath$ \Sigma $}_{g}$ ( $g=1,2$ ) for the VEE and VEI models are the same as those considered in Celeux and Govaert (1995) and Lin (2014). Each mixture component is centred on a different point giving well-separated and overlapping mixtures. Where applicable, the skewness parameters are $\mbox{\boldmath$ \beta $}_{1}=(1,1)^{\intercal}$ and $\mbox{\boldmath$ \beta $}_{2}=(-1,-1)^{\intercal}$ , the degrees of freedoms for the MST is $v_{1}=7$ and $v_{2}=5$ , and the values of other parameters for the MGHD are $\omega_{1}=\omega_{2}=6$ and $\lambda_{1}=-{1}/{2}$ and $\lambda_{2}=1$ .

The datasets considered in the simulation studies are summarized in Table 1 and examples are plotted in Figure 1. The datasets are overlapping, making this a relatively difficult clustering scenario even when the datasets are complete.

Artificial missing datasets are simulated by removing $n\times r$ elements from each column of the simulated samples through two different MAR patterns and the MCAR mechanism under three missing rates — $r=0.05$ (low), $r=0.15$ (moderate), and $r=0.3$ (high) — while maintaining the condition that each observation has at least one observed attribute. For the MAR mechanism, data points in the first column are sorted in descending order. Column $2$ is then divided into four equal blocks and, for each block, a specified number of elements (see Table 2) are removed at random. When $p=1$ , the second column is used.

First, we examine the ability of our proposed model to recover underlying parameters when the number of components and the covariance structure are correctly specified. These experiments comprise 100 replications per combination of missing pattern and missingness rate. The means of the parameter estimates with their associated standard deviations and bias are summarized in Table 8 and 9 (Appendix E). The means of most parameter estimates are close to the true values with small standard deviations when $r=0.05$ . The standard deviations increase as the missing rate increases, while at the same time, the average ARI slightly decreases. The means of estimated $\lambda_{1}$ and $\lambda_{2}$ in Sim1 are quite far from the true value because we obtain those estimates using an approximation to the Bessel function. In addition, there is no significant difference among the three missing patterns. Therefore, we use MCAR in the rest of the data examples.

As another illustration, we explore the flexibility of the MGHD model for incomplete data and study the performance of the BIC for model selection. As mentioned in the introduction, the GHD is a flexible distribution with skewness, concentration, and index parameters. We compute the average ARI for the parsimonious MGHD and MST models introduced here as well as Mt under the circumstances of unknown clusters ( $G=1,\ldots,4$ ). The detailed results are summarized in Table 10 (Appendix E). From Table 10, we observe the following:

•

The average ARI decreases as the missing rate rises. As expected, overlapping components typically have lower ARI than the well-separated components. In addition, the average ARI considerably decreases when the missing rate reaches 30% $(r=0.30)$ for Sim2, Sim4 and Sim6.

•

Our proposed parsimonious MGHD models for incomplete data perform significantly better than Mt. The family of MGHD models generally yields much higher ARI than its competitor parsimonious MST for incomplete data when the datasets are generated from a generalized hyperbolic distribution.

•

The BIC always finds the true number of clusters when using the MGHD for incomplete data, but tends to overestimate the number of clusters when using the MST or Mt for incomplete data for datasets with overlapping mixtures.

•

The BIC prefers MGHD over Mt in Sim5 and Sim6 where the data is generated from GMMs. We find that the samples are not necessarily symmetric, particularly with missing values. Figure 2 and 3 show exemplar scatter plots for data from Sim5 and Sim6 for $r=0.10$ . The Mt tends to overestimate the number of clusters, hence, has a lower averaged BIC.

5.2 Breast Cancer Diagnostic Dataset

The breast cancer diagnostic data consists of ten real-valued features on 569 cases of breast tumours – 357 benign and 212 malignant. The mean, standard error, and “worst” or largest of these features were computed for each image, resulting in 30 attributes. This dataset is complete, so for illustration purposes we consider levels of missing data $r=0.05$ and $r=0.15$ by deleting observations through an MCAR mechanism while maintaining the condition that each observation has at least one observed attribute. The dataset is scaled prior to analysis.

The family of MGHD, MST and Mt models were fitted to these data for $G=1,\ldots,4$ . We randomly assign each observation to one of the G groups and start with 20 random initializations of the algorithm, selecting the model with the maximum likelihood values. The key statistics of the best models for MGHD, MST and Mt are shown in Table 3. The results of this analysis show that the parsimonious MGHD outperforms the other models for all levels of missing data.

5.3 Pima Indians Diabetes Data

Data on the diabetes status of 768 patients is obtained from the UCI Machine Learning data repository. The data include information on eight attributes, in which the attribute of number of times pregnant is treated as continuous variable because its range is from 0 to 14. These data are a popular benchmark dataset for clustering for truly missing values, as 376 of the observations have at least one attribute missing. The data are overlapping and the numerous missing observations make clustering difficult. The detailed description of the attributes and their associated missing rates are summarized in Table 4. The dataset features 268 patients with a diabetes diagnosis and 500 without, and these are treated as two clusters. Again, this dataset is scaled prior to the analysis.

Because there are two known clusters, we fix $G=2$ and compare the BIC and ICL values for 14 covariance structures of our proposed parsimonious MGHD and MST models. The clustering results are summarized in Table 5. Lin (2014) perform the Mt and matches the true cluster labels with 66.7% accuracy. Compared to Lin (2014), our proposed parsimonious MGHD model for incomplete data gives a higher accuracy rate (69.11%).

The best model is the two-component MGHD model and $\mbox{\boldmath$ \Sigma $}_{g}$ =EVE. Group 1 consists mainly of the non-diabetic patients and Group 2 consists mainly of the diabetic patients. We then fit the best model with 100 random initializations; Table 6 shows the key parameter estimates for this model as well as the corresponding standard errors. The standard errors of the model parameters have been calculated using the bootstrap method described in Efron and Tibshirani (1986). The estimates for $\mbox{\boldmath$ \mu $}_{g}+\mbox{\boldmath$ \beta $}_{g}$ are quite similar to the parameter estimates presented in Wang and Lin (2015). The estimates for the skewness parameters indicate the presence of skewness in most of the variables.

6 Discussion

Approaches for clustering incomplete data where clusters may be heavy tailed and/or asymmetric is introduced, based on MGHD and MST. There approaches were further extended to parsimonious families of MGHD and MST models via eigen-decomposition of the component scale matrices. The BIC and ICL were used for model selection. It is well known that the BIC can tend to overestimate the number of clusters in practice; however, the results presented herein show that this overestimation can sometimes be mitigated via a more flexible component density such as the MGHD. An EM algorithm was developed to fit the MGHD and MST models to incomplete data, and later implemented in R. It is worth mentioning that our approaches are also applicable in situations with no missing data; and so we have MGHD and MST analogues of the models of Celeux and Govaert (1995). Our MGHD and MST models were applied to real and simulated heterogeneous datasets for clustering in the presence of missing values, and the PMGHD family performed favourably when compared to the PMST family as well as the MGHD and MST approaches with mean imputation.

In the present work, the missing data mechanism is assumed to be MAR. Future work will focus on a departure from this assumption. As a starting point, the behaviour of parameter estimates for models considered herein when we depart from the MAR assumption will be studied. Although we demonstrated the PMGHD and PMST approaches for clustering, they also can be applied for semi-supervised classification, discriminant analysis, and density estimation; furthermore, they could be used within the fractionally-supervised paradigm (Vrbik and McNicholas, 2015). Furthermore, Bayesian analysis via a Gibbs sampler is another popular approach to handle missing data in multivariate datasets (e.g., Lin et al., 2009), so a fully Bayesian treatment will be considered as an alternative to the EM algorithm for parameter estimation. Finally, it will also be interesting to generalize all existing approaches to developing mixture of generalized hyperbolic factor analyzer models (Tortora et al., 2016), mixtures with hypercube contours (Franczak et al., 2015), and mixtures of multiple scaled generalized hyperbolic distributions for incomplete data (Tortora et al., 2017).

Acknowledgements

This work was supported by an Ontario Graduate Scholarship (Wei), an Early Researcher Award from the Government of Ontario (McNicholas), and the Canada Research Chairs program (McNicholas).

Appendix A GPCM Family

Banfield and Raftery (1993) consider an eigen-decomposition of the component scale matrices (which is equivalent to the component covariance matrices for Gaussian mixtures), i.e.,

[TABLE]

where $\lambda_{g}=\left|\mbox{\boldmath$ \Sigma $}_{g}\right|^{1/p}$ , $\mbox{\boldmath$ \Gamma $}_{g}$ is the matrix of eigenvectors of $\mbox{\boldmath$ \Sigma $}_{g}$ , and $\mbox{\boldmath$ \Delta $}_{g}$ is a diagonal matrix, such that $\left|\mbox{\boldmath$ \Delta $}_{g}\right|=1$ , containing the normalized eigenvalues of $\mbox{\boldmath$ \Sigma $}_{g}$ in decreasing order. Note that the columns of $\mbox{\boldmath$ \Gamma $}_{g}$ are ordered to correspond to the elements of $\mbox{\boldmath$ \Delta $}_{g}$ . As Banfield and Raftery (1993) point out, the constituent elements of the decomposition in (22) can be viewed in the context of the geometry of the component, where $\lambda_{g}$ represents the volume in $p$ -space, $\mbox{\boldmath$ \Delta $}_{g}$ the shape, and $\mbox{\boldmath$ \Gamma $}_{g}$ the orientation. By imposing constraints on the elements of the decomposed covariance structure in (22), Celeux and Govaert (1995) introduce a family of GPCMs (Table 7).

Appendix B Some Useful Matrix Computations

We here present some useful matrix computation results that are employed in the derivation of the conditional pdf of a partitioned generalized hyperbolic and multivariate skew-t random vector $\mathbf{X}$ in Propositions 3 and 6.

Consider a partitioned random vector $\mathbf{X}$ of $p$ -dimension that follows the pdf as in (9) with

[TABLE]

where $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ have dimensions $d_{1}$ and $d_{2}=p-d_{1}$ , respectively. The mean, skewness and dispersion matrix are composed of blocks of appropriate dimensions as partitions of $\mathbf{X}$ . Sometimes, it is more convenient to work with the inverse of dispersion matrix $\mbox{\boldmath$ \Sigma $}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}$ :

[TABLE]

Furthermore, we have for the determinant of $\Sigma$ :

[TABLE]

Appendix C Outline of Proof of Proposition 3

Here, we derive the conditional density of $\mathbf{X}_{2}$ given that $\mathbf{X}_{1}=\mathbf{x}_{1}$ if $\mathbf{X}_{1}$ and $\mathbf{X}_{2}$ are jointly generalized hyperbolic distributed, i.e., $\mathbf{X}\sim\text{GHD}_{p}(\lambda,\omega,\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \Sigma $},\mbox{\boldmath$ \beta $})$ with the partition in Appendix A. Although basic probability theory indicates that the conditional pdf is a ratio of the joint and marginal pdfs, the expression takes a very complicated form. The results from Appendix A are heavily used in the course of the derivations. The conditional density is given by

[TABLE]

where we combine (9) and Proposition 2. For the moment, we focus on the linear form and quadratic form in which $\mathbf{x}$ enters the pdf in (9). Inserting the partition of $\mathbf{X},\mbox{\boldmath$ \mu $},\mbox{\boldmath$ \beta $}$ , and $\Sigma$ in (23) and the inverse of dispersion matrix $\mbox{\boldmath$ \Sigma $}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}$ (24) into the quadratic form yields

[TABLE]

where $\mbox{\boldmath$ \mu $}_{2\mid 1}=\mbox{\boldmath$ \mu $}_{2}+\mbox{\boldmath$ \Sigma $}_{12}^{\intercal}\mbox{\boldmath$ \Sigma $}_{11}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}(\mathbf{x}_{1}-\mbox{\boldmath$ \mu $}_{1})$ and $\mbox{\boldmath$ \Sigma $}_{2\mid 1}=(\mbox{\boldmath$ \Sigma $}_{22}-\mbox{\boldmath$ \Sigma $}_{12}^{\intercal}\mbox{\boldmath$ \Sigma $}_{11}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}\mbox{\boldmath$ \Sigma $}_{12})^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}$ .

Similarly, inserting into the linear form, following the same algebra as above, yields

[TABLE]

where $\mbox{\boldmath$ \mu $}_{2\mid 1}$ and $\mbox{\boldmath$ \Sigma $}_{2\mid 1}$ are as described above, and $\mbox{\boldmath$ \beta $}_{2\mid 1}=\mbox{\boldmath$ \beta $}_{2}-\mbox{\boldmath$ \Sigma $}_{12}^{\intercal}\mbox{\boldmath$ \Sigma $}_{11}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}\mbox{\boldmath$ \beta $}_{1}$ .

Furthermore, we investigate the term $\mbox{\boldmath$ \beta $}^{\intercal}\mbox{\boldmath$ \Sigma $}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}\mbox{\boldmath$ \beta $}$ , we obtain

[TABLE]

Finally, we substitute (25), (26), (27), and (28), and $p=d_{1}+d_{2}$ into the conditional density, and after some simple linear algebra, we obtain

[TABLE]

Set $\lambda_{2\mid 1}=\lambda-\frac{d_{1}}{2}$ , $\chi_{2\mid 1}=\omega+\delta(\mathbf{x}_{1},\mbox{\boldmath$ \mu $}_{1}\mid\mbox{\boldmath$ \Sigma $}_{11})$ , and $\psi_{2\mid 1}=\omega+\mbox{\boldmath$ \beta $}_{1}^{\intercal}\mbox{\boldmath$ \Sigma $}_{11}^{\raisebox{0.54248pt}{$ \scriptscriptstyle-1 $}}\mbox{\boldmath$ \beta $}_{1}$ , then we obtain

[TABLE]

Comparison with (6) reveals that this is a generalized hyperbolic distribution in the parameterization of McNeil et al. (2005) with

[TABLE]

Appendix D MST with Incomplete Data

Analogous to the MGHD model (14), the MST model takes the density

[TABLE]

where $\mbox{\boldmath$ \Theta $}=(\mathbf{\pi},\textbf{v}_{g},\mbox{\boldmath$ \mu $}_{g},\mbox{\boldmath$ \Sigma $}_{g},\mbox{\boldmath$ \beta $}_{g})$ with $\textbf{v}_{g}=(v_{1},\ldots,v_{g})$ and $\pi_{g},\mbox{\boldmath$ \mu $}_{g},\mbox{\boldmath$ \Sigma $}_{g}$ , and $\mbox{\boldmath$ \beta $}_{g}$ are as defined above. By introducing the group membership variables $\mathbf{Z}_{i}\sim\mathcal{M}(1;\pi_{1},\ldots,\pi_{G})$ , convenient three-layer hierarchical representations are given by

[TABLE]

Assume that the matrix $\mathbf{X}=(\mathbf{X}^{\text{o}\intercal},\mathbf{X}^{\text{m}\intercal})^{\intercal}$ contains missing data. For each $\mathbf{x}_{i}=(\mathbf{x}_{i}^{\text{o}\intercal},\mathbf{x}_{i}^{\text{m}\intercal})^{\intercal}$ , we write $\mbox{\boldmath$ \mu $}_{g}=(\mbox{\boldmath$ \mu $}_{g,i}^{\text{o}\intercal},\mbox{\boldmath$ \mu $}_{g,i}^{\text{m}\intercal})^{\intercal}$ , $\mbox{\boldmath$ \beta $}_{g}=(\mbox{\boldmath$ \beta $}_{g,i}^{\text{o}\intercal},\mbox{\boldmath$ \beta $}_{g,i}^{\text{m}\intercal})^{\intercal}$ , and finally the $g$ th dispersion matrix $\mbox{\boldmath$ \Sigma $}_{g}$ is partitioned as in (17). Hence, based on (30), we have the following conditional distributions:

•

The marginal distribution of $\mathbf{X}_{i}^{\text{o}}$ is

[TABLE]

where $p_{i}^{\text{o}}$ is the dimension corresponding to the observed component $\mathbf{x}_{i}^{\text{o}}$ , which should be exactly written as $p_{i}^{\text{o}_{i}}$ but here is simplified.

•

The conditional distribution of $\mathbf{X}_{i}^{\text{m}}$ given $\mathbf{x}_{i}^{\text{o}}$ and $z_{ig}=1$ , according to Proposition 6, is

[TABLE]

where

[TABLE]

•

The conditional distribution of $\mathbf{X}_{i}^{\text{m}}$ given $\mathbf{x}_{i}^{\text{o}},w_{ig}$ , and $z_{ig}=1$ is

[TABLE]

•

The conditional distribution of $W_{i}$ given $\mathbf{x}_{i}^{\text{o}}$ and $z_{ig}=1$ is

[TABLE]

As in the case of the MGHD model with incomplete data, the complete data consists of the observed $\mathbf{x}_{i}$ , the missing group membership $z_{ig}$ , the latent $w_{ig}$ , as well as the actual missing data $\mathbf{x}_{i}^{\text{m}}$ , for $i=1,\ldots,n$ and $g=1,\ldots,G$ . Again, the complete data log-likelihood function is given by

[TABLE]

Furthermore, one can simplify (34) to

[TABLE]

On the $k$ th iteration of the E-step, the expected value of the complete-data log-likelihood is computed given the observed data $\mathbf{X}^{\text{o}}$ and the current parameter updates $\mbox{\boldmath$ \Theta $}^{(k)}$ . Denote by $\tau_{ig}^{(k)}$ the a posteriori probability that the $i$ th observation belongs to the $g$ th component of the mixture. Specifically, it can be calculated as

[TABLE]

Given the observed data $\mathbf{x}^{\text{o}}$ , the current parameter updates $\mbox{\boldmath$ \Theta $}^{(k)}$ , and conditional distributions (31) and (33), taking expectations for (35) leads to the following expectation updates in the E-step:

[TABLE]

For convenience, let $n_{g}^{(k)}=\sum_{i=1}^{n}\tau_{ig}^{(k)}$ , $\bar{A}_{g}^{(k)}=1/n_{g}^{(k)}\sum_{i=1}^{n}\tau_{ig}^{(k)}A_{ig}^{(k)}$ , $\bar{B}_{g}^{(k)}=1/n_{g}^{(k)}\sum_{i=1}^{n}\tau_{ig}^{(k)}B_{ig}^{(k)}$ , and $\bar{C}_{g}^{(k)}=1/n_{g}^{(k)}\sum_{i=1}^{n}\tau_{ig}^{(k)}C_{ig}^{(k)}$ . On the $k$ th iteration of the M-step, we get updates for the parameter estimates of the mixture as follows:

[TABLE]

where

[TABLE]

where

[TABLE]

Finally, as for the degree of freedom parameter $v_{g}$ , the update does not exist in closed form. The update $v_{g}^{(k+1)}$ is the solution of

[TABLE]

where $\varphi(\cdot)$ is the digamma function.

Appendix E Results from Simulation Studies

The results from the simulation studies are summarized in Tables 8, 9 and 10.

[FIGURE:]

Bibliography68

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aitken (1926) Aitken, A. C. (1926). On Bernoulli’s numerical solution of algebraic equations. Proceedings of the Royal Society of Edinburgh 46 , 289–305.
2Andrews and Mc Nicholas (2011) Andrews, J. L. and P. D. Mc Nicholas (2011). Extending mixtures of multivariate t-factor analyzers. Statistics and Computing 21 (3), 361–373.
3Andrews and Mc Nicholas (2012) Andrews, J. L. and P. D. Mc Nicholas (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing 22 (5), 1021–1029.
4Arellano-Valle and Genton (2010) Arellano-Valle, R. and M. G. Genton (2010). Multivariate extended skew-t distributions and related families. Metron 68 (3), 201–234.
5Banfield and Raftery (1993) Banfield, J. D. and A. E. Raftery (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (3), 803–821.
6Barndorff-Nielsen (1977) Barndorff-Nielsen, O. (1977). Exponentially decreasing distributions for the logarithm of particle size. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 353 (1674), 401–419.
7Barndorff-Nielsen (1978) Barndorff-Nielsen, O. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics 5 (3), 151–157.
8Barndorff-Nielsen and Blæsild (1981) Barndorff-Nielsen, O. and P. Blæsild (1981). Hyperbolic distributions and ramifications: Contributions to theory and application. In C. Taillie, G. Patil, and B. Baldessari (Eds.), Statistical Distributions in Scientific Work , Volume 79 of NATO Advanced Study Institutes Series , pp. 19–44. Springer Netherlands.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Mixtures of Generalized Hyperbolic Distributions and Mixtures of Skew-t Distributions for Model-Based Clustering with Incomplete Data

Abstract

1 Introduction

2 Background

2.1 Generalized Inverse Gaussian Distribution

2.2 Generalized Hyperbolic Distribution

Proposition 1**.**

Proof.

Proposition 2**.**

Proof.

Proposition 3**.**

2.3 The Multivariate Skew-ttt Distribution

Proposition 4**.**

Proof.

Proposition 5**.**

Proof.

Proposition 6**.**

3 MGHD with Incomplete Data

4 Notes on Implementation

4.1 Initial values

4.2 Model Selection and Stopping Criterion

5 Numerical Examples

5.1 Simulation Studies

5.2 Breast Cancer Diagnostic Dataset

5.3 Pima Indians Diabetes Data

6 Discussion

Acknowledgements

Appendix A GPCM Family

Appendix B Some Useful Matrix Computations

Appendix C Outline of Proof of Proposition 3

Appendix D MST with Incomplete Data

Appendix E Results from Simulation Studies

Proposition 1.

Proposition 2.

Proposition 3.

2.3 The Multivariate Skew- $t$ Distribution

Proposition 4.

Proposition 5.

Proposition 6.