Robust mixture modelling using sub-Gaussian stable distribution

Mahdi Teimouri; Saeid Rezakhah; Adel Mohammdpour

arXiv:1701.06749·stat.ML·January 25, 2017

Robust mixture modelling using sub-Gaussian stable distribution

Mahdi Teimouri, Saeid Rezakhah, Adel Mohammdpour

PDF

Open Access

TL;DR

This paper introduces an EM algorithm for mixture models based on sub-Gaussian stable distributions, demonstrating robustness and effectiveness in modeling heavy-tailed data across various datasets.

Contribution

It presents a novel EM algorithm for parameter estimation in mixtures of sub-Gaussian stable distributions, a computationally tractable subclass of stable distributions.

Findings

01

The proposed model shows robustness in heavy-tailed data scenarios.

02

It outperforms some existing mixture models in simulations and real data.

03

The approach is effective for synthetic, simulated, and real datasets.

Abstract

Heavy-tailed distributions are widely used in robust mixture modelling due to possessing thick tails. As a computationally tractable subclass of the stable distributions, sub-Gaussian $α$ -stable distribution received much interest in the literature. Here, we introduce a type of expectation maximization algorithm that estimates parameters of a mixture of sub-Gaussian stable distributions. A comparative study, in the presence of some well-known mixture models, is performed to show the robustness and performance of the mixture of sub-Gaussian $α$ -stable distributions for modelling, simulated, synthetic, and real data.

Equations56

Y_{i} = L μ + P_{i} G_{i}, i = 1, \dots, n,

Y_{i} = L μ + P_{i} G_{i}, i = 1, \dots, n,

g (y_{i} ∣ θ) = j = 1 \sum K w_{j} f (y_{i} ∣ α_{j}, Σ_{j}, μ_{j}), i = 1, \dots, n,

g (y_{i} ∣ θ) = j = 1 \sum K w_{j} f (y_{i} ∣ α_{j}, Σ_{j}, μ_{j}), i = 1, \dots, n,

\hat{θ} = \frac{1}{N - N _{0}} t = N_{0} + 1 \sum N θ^{(t)},

\hat{θ} = \frac{1}{N - N _{0}} t = N_{0} + 1 \sum N θ^{(t)},

Y_{i} ∣ P_{i} = p_{i}, Z_{ij} = 1 \sim N (μ_{j}, p_{i} Σ_{j}),

Y_{i} ∣ P_{i} = p_{i}, Z_{ij} = 1 \sim N (μ_{j}, p_{i} Σ_{j}),

P_{i}|Z_{ij}=1\sim S\Bigl{(}\alpha_{j}/2,1,\bigl{(}\cos(\pi\alpha_{j}/4)\bigr{)}^{2/\alpha_{j}},0\Bigr{)},

P_{i}|Z_{ij}=1\sim S\Bigl{(}\alpha_{j}/2,1,\bigl{(}\cos(\pi\alpha_{j}/4)\bigr{)}^{2/\alpha_{j}},0\Bigr{)},

f_{c} (y_{1}, \dots, y_{n}, z_{1}, \dots, z_{n}, p_{1}, \dots, p_{n} ∣ θ) = i = 1 \prod n f (y_{i}, p_{i}, z_{i} ∣ θ),

f_{c} (y_{1}, \dots, y_{n}, z_{1}, \dots, z_{n}, p_{1}, \dots, p_{n} ∣ θ) = i = 1 \prod n f (y_{i}, p_{i}, z_{i} ∣ θ),

f (y_{i}, p_{i}, z_{i} ∣ θ) =

f (y_{i}, p_{i}, z_{i} ∣ θ) =

=

\times {f_{Y_{i} ∣ P_{i}, Z_{i 1}} (y_{i} ∣ p_{i}, z_{i 1}, μ_{1}, Σ_{1})}^{z_{i 1}} \times \dots \times {f_{Y_{i} ∣ P_{i}, Z_{i 1}} (y_{i} ∣ p_{i}, z_{i K}, μ_{K}, Σ_{K})}^{z_{i K}}

=

l_{c} (θ) = C

l_{c} (θ) = C

- \frac{n}{2} j = 1 \sum K i = 1 \sum n z_{ij} lo g ∣ Σ_{j} ∣ - \frac{1}{2} j = 1 \sum K i = 1 \sum n z_{ij} \frac{( y _{i} - μ _{j} ) ^{T} Σ _{j}^{- 1} ( y _{i} - μ _{j} )}{p _{i}},

\displaystyle Q\left(\theta\Big{|}\theta^{(t)}\right)=\text{C}

\displaystyle Q\left(\theta\Big{|}\theta^{(t)}\right)=\text{C}

- \frac{n}{2} j = 1 \sum K lo g ∣ Σ_{j} ∣ - \frac{1}{2} j = 1 \sum K i = 1 \sum n e_{1 ij}^{(t)} e_{2 ij}^{(t)} (y_{i} - μ_{j})^{T} Σ_{j}^{- 1} (y_{i} - μ_{j}),

e_{1 ij}^{(t)} = E (Z_{ij} ∣ y_{i}, μ_{j}^{(t)}, Σ_{j}^{(t)}, α_{j}^{(t)}) = \frac{w _{j}^{(t)} f ( y _{i} ; α _{j}^{(t)} , Σ _{j}^{(t)} , μ _{j}^{(t)} )}{\sum _{j = 1}^{K} f ( y _{i} ; α _{j}^{(t)} , Σ _{j}^{(t)} , μ _{j}^{(t)} )},

e_{1 ij}^{(t)} = E (Z_{ij} ∣ y_{i}, μ_{j}^{(t)}, Σ_{j}^{(t)}, α_{j}^{(t)}) = \frac{w _{j}^{(t)} f ( y _{i} ; α _{j}^{(t)} , Σ _{j}^{(t)} , μ _{j}^{(t)} )}{\sum _{j = 1}^{K} f ( y _{i} ; α _{j}^{(t)} , Σ _{j}^{(t)} , μ _{j}^{(t)} )},

e_{2 ij}^{(t)} = E (P_{i}^{- 1} ∣ y_{i}, μ_{j}^{(t)}, Σ_{j}^{(t)}, α_{j}^{(t)}) .

e_{2 ij}^{(t)} = E (P_{i}^{- 1} ∣ y_{i}, μ_{j}^{(t)}, Σ_{j}^{(t)}, α_{j}^{(t)}) .

w_{j}^{(t + 1)} = \frac{1}{n} i = 1 \sum n e_{1 ij}^{(t)},

w_{j}^{(t + 1)} = \frac{1}{n} i = 1 \sum n e_{1 ij}^{(t)},

μ_{j}^{(t + 1)} = \frac{\sum _{i = 1}^{n} e _{1 ij}^{(t)} e _{2 ij}^{(t)} y _{i}}{\sum _{i = 1}^{n} e _{1 ij}^{(t)} e _{2 ij}^{(t)}},

μ_{j}^{(t + 1)} = \frac{\sum _{i = 1}^{n} e _{1 ij}^{(t)} e _{2 ij}^{(t)} y _{i}}{\sum _{i = 1}^{n} e _{1 ij}^{(t)} e _{2 ij}^{(t)}},

Y_{i}^{j} = \frac{Y _{i}^{j} - μ _{j}^{(t + 1)}}{E _{i}} = L \frac{P _{i} N _{i} ( 0 , Σ _{j}^{(t + 1)} )}{E _{i}}, i = 1, \dots, n_{j},

Y_{i}^{j} = \frac{Y _{i}^{j} - μ _{j}^{(t + 1)}}{E _{i}} = L \frac{P _{i} N _{i} ( 0 , Σ _{j}^{(t + 1)} )}{E _{i}}, i = 1, \dots, n_{j},

\frac{P _{i}}{E _{i}} = L V_{i}^{j},

\frac{P _{i}}{E _{i}} = L V_{i}^{j},

Y_{i}^{j} ∣ V_{i}^{j} = v_{i}^{j}

Y_{i}^{j} ∣ V_{i}^{j} = v_{i}^{j}

V_{i}^{j}

l_{c} (α_{j}, Σ_{j}) = C

l_{c} (α_{j}, Σ_{j}) = C

- \frac{1}{2} i = 1 \sum n_{j} (v_{i}^{j})^{2} (Y_{i}^{j})^{T} Σ_{j}^{- 1} (Y_{i}^{j}),

Σ_{j}^{(t + 1)}

Σ_{j}^{(t + 1)}

h (α_{j}) = \frac{n _{j}}{α _{j}} + i = 1 \sum n_{j} lo g v_{i} - i = 1 \sum n_{j} v_{i}^{α_{j}} lo g v_{i} = 0,

h (α_{j}) = \frac{n _{j}}{α _{j}} + i = 1 \sum n_{j} lo g v_{i} - i = 1 \sum n_{j} v_{i}^{α_{j}} lo g v_{i} = 0,

\hat{λ}_{j} = \frac{1}{N - N _{0}} t = N_{0} + 1 \sum N λ_{j}^{(t)},

\hat{λ}_{j} = \frac{1}{N - N _{0}} t = N_{0} + 1 \sum N λ_{j}^{(t)},

\displaystyle e^{(t)}_{2ij}=\frac{\int_{0}^{\infty}u^{-d/2-1}f_{P}\bigl{(}u|\alpha^{(t)}_{j}\bigr{)}\exp\left(-\frac{\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}^{T}\bigl{(}\Sigma^{(t)}_{j}\bigr{)}^{-1}\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}}{2u}\right)du}{\int_{0}^{\infty}u^{-d/2}f_{P}\bigl{(}u|\alpha^{(t)}_{j}\bigr{)}\exp\left(-\frac{\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}^{T}\bigl{(}\Sigma^{(t)}_{j}\bigr{)}^{-1}\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}}{2u}\right)du}.

\displaystyle e^{(t)}_{2ij}=\frac{\int_{0}^{\infty}u^{-d/2-1}f_{P}\bigl{(}u|\alpha^{(t)}_{j}\bigr{)}\exp\left(-\frac{\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}^{T}\bigl{(}\Sigma^{(t)}_{j}\bigr{)}^{-1}\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}}{2u}\right)du}{\int_{0}^{\infty}u^{-d/2}f_{P}\bigl{(}u|\alpha^{(t)}_{j}\bigr{)}\exp\left(-\frac{\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}^{T}\bigl{(}\Sigma^{(t)}_{j}\bigr{)}^{-1}\bigl{(}\boldsymbol{y}_{i}-\boldsymbol{\mu}^{(t)}_{j}\bigr{)}}{2u}\right)du}.

f_{\boldsymbol{\cal{Y}}^{j}_{i}|V^{j}_{i},\alpha_{j},\Sigma_{j}}(\boldsymbol{\cal{Y}}^{j}_{i}|v^{j}_{i},\alpha_{j},\Sigma_{j})=\frac{\bigl{(}v^{j}_{i}\bigr{)}^{d}}{(2\pi)^{d/2}|\Sigma_{j}^{(t)}|^{1/2}}\exp\left\{-\frac{\bigl{(}(\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}\bigr{)}\bigl{(}v^{j}_{i}\bigr{)}^{2}}{2}\right\},

f_{\boldsymbol{\cal{Y}}^{j}_{i}|V^{j}_{i},\alpha_{j},\Sigma_{j}}(\boldsymbol{\cal{Y}}^{j}_{i}|v^{j}_{i},\alpha_{j},\Sigma_{j})=\frac{\bigl{(}v^{j}_{i}\bigr{)}^{d}}{(2\pi)^{d/2}|\Sigma_{j}^{(t)}|^{1/2}}\exp\left\{-\frac{\bigl{(}(\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}\bigr{)}\bigl{(}v^{j}_{i}\bigr{)}^{2}}{2}\right\},

f_{V_{i}^{j} ∣ Y_{i}^{j}, α_{j}, Σ_{j}} (v_{i}^{j} ∣ Y_{i}^{j}, α_{j}, Σ_{j})

f_{V_{i}^{j} ∣ Y_{i}^{j}, α_{j}, Σ_{j}} (v_{i}^{j} ∣ Y_{i}^{j}, α_{j}, Σ_{j})

\displaystyle=\frac{\alpha\bigl{(}v^{j}_{i}\bigr{)}^{d+\alpha-1}}{(2\pi)^{d/2}|\Sigma_{j}^{(t)}|^{1/2}}\exp\left\{-\frac{\bigl{(}(\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}\bigr{)}\bigl{(}v^{j}_{i}\bigr{)}^{2}}{2}-\bigl{(}v^{j}_{i}\bigr{)}^{\alpha}\right\},

\displaystyle\frac{\exp\left\{-\frac{d}{2}\right\}\bigl{(}\frac{d}{(\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}}\bigr{)}^{\frac{d}{2}}}{(2\pi)^{d/2}|\Sigma_{j}^{(t)}|^{1/2}},

\displaystyle\frac{\exp\left\{-\frac{d}{2}\right\}\bigl{(}\frac{d}{(\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}}\bigr{)}^{\frac{d}{2}}}{(2\pi)^{d/2}|\Sigma_{j}^{(t)}|^{1/2}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Statistical Distribution Estimation and Applications · Statistical Methods and Bayesian Inference

Full text

Robust mixture modelling using

sub-Gaussian $\alpha$ -stable distribution

Mahdi Teimouri

Saeid Rezakhah

Adel Mohammdpour

Abstract: Heavy-tailed distributions are widely used in robust mixture modelling due to possessing thick tails. As a computationally tractable subclass of the stable distributions, sub-Gaussian $\alpha$ -stable distribution received much interest in the literature. Here, we introduce a type of expectation maximization algorithm that estimates parameters of a mixture of sub-Gaussian $\alpha$ -stable distributions. A comparative study, in the presence of some well-known mixture models, is performed to show the robustness and performance of the mixture of sub-Gaussian $\alpha$ -stable distributions for modelling, simulated, synthetic, and real data.

Keyword: Clustering; EM algorithm; Monte Carlo; Mixture models; Robustness; Stable distribution; sub-Gaussian $\alpha$ -stable distribution

1 Introduction

Finite mixture models are in fact a convex combination of two or more probability density functions. As the most critical application, these models received much interest in the model-based clustering which focuses mainly on the mixture of Gaussian distributions. Despite the popularity of Gaussian-based clustering, this algorithm shows poor performance in the presence of outliers. So, robust mixture models are becoming increasingly popular to overcome this issue. Some of these models aim to tackle tail-weight, [1], [4], [23], and [34]; some deal with skewness, [2], [5], [13], and [33]. Stable distributions have received extensive use in vast majority of fields such as finance, and telecommunications, [18], [19], [20], [24], and [27]. Statistical modelling of datasets gathered from these fields using normal distribution is quite improper because of heavy tails. Except for three cases, probability density function (pdf) of the class of stable distributions has not closed form. As a computationally tractable subclass of the multivariate stable distribution, the sub-Gaussian $\alpha$ -stable distribution can account for modelling processes with outliers. The sub-Gaussian $\alpha$ -stable distributions have received much interest in finance and portfolio optimization, [18], [20], and [25]. So, several attempts have been made to estimate the parameters of sub-Gaussian $\alpha$ -stable distribution. Among them we cite [3], [21], [26], and [28]. The sub-Gaussian $\alpha$ -stable distribution allows for heavier tails than Student’s $t$ distribution; it can be used as a more flexible tool robust model-based clustering. On the other hand, other approaches that have been developed for estimating the parameters of sub-Gaussian $\alpha$ -stable distribution have no possibility of being extended for the mixture of sub-Gaussian $\alpha$ -stable distributions. This motivated us to develop a method to estimating the parameters of the mixture of sub-Gaussian $\alpha$ -stable distributions. It should be noted that idea of the Bayesian approach for estimating the parameters of mixture of sub-Gaussian $\alpha$ -stable distribution suggested in [30] and they did not follow it. Our investigations reveal that the proposed EM algorithm shows better performance, regarding execution time than the Bayesian paradigm. The structure of this note is as follows. In what follows, we give some preliminaries. The Proposed EM algorithm is described in Section 3. Section 4 is devoted to performance analysis of the proposed EM algorithm through simulation, real data, and synthetic data.

2 Preliminaries

Let $\boldsymbol{Y}_{i}=(Y_{i1},\dots,Y_{id})^{T}$ be a sub-Gaussian $\alpha$ -stable random vector. Then a random sample

[TABLE]

where $\boldsymbol{G}_{i}=(G_{i1},\dots,G_{id})^{T}$ is a zero-mean Gaussian random vector with a positive definite symmetric $d\times d$ shape matrix $\Sigma$ , $P_{i}$ follows $S\bigl{(}\alpha/2,\left(\cos(\pi\alpha/4)\right)^{2/\alpha},1,0\bigr{)}$ , $\boldsymbol{\mu}\in\mathbb{R\textbullet}^{d}$ is a location parameter, and $0<\alpha<2$ . Here, $?\mathop{=}\limits^{L}?$ denotes equality in distribution and $P_{i}$ and $\boldsymbol{G}_{i}$ are statistically independent. The corresponding observed values of $\boldsymbol{Y}_{i}$ , $\boldsymbol{G}_{i}$ , and ${P}_{i}$ are $\boldsymbol{y}_{i}=(y_{i1},\dots,y_{id})^{T}$ , $\boldsymbol{g}_{i}=(g_{i1},\dots,g_{id})^{T}$ , and ${p}_{i}$ , respectively, for $i=1,\dots,n$ , [31]. If $f(\boldsymbol{y}_{i}|\alpha_{j},\boldsymbol{\Sigma}_{j},\boldsymbol{\mu}_{j})$ denotes the pdf of $\boldsymbol{Y}_{i}$ with parameters $\alpha_{j}$ , $\boldsymbol{\Sigma}_{j}$ , and $\boldsymbol{\mu}_{j}$ at point $\boldsymbol{y}_{i}$ , then the pdf of a $K$ -component sub-Gaussian $\alpha$ -stable mixture model, $g(\boldsymbol{y}_{i}|\theta)$ , has the form

[TABLE]

where $\theta=(\alpha_{j},\Sigma_{j},\mu_{j};j=1,\dots,K)$ , $n$ is the sample size, and $w_{j}$ s are non-negative mixing parameters that sum to one, i.e. $\sum_{j=1}^{K}w_{j}=1$ . Hereafter, we use notation SG $\boldsymbol{\alpha}$ SM as a symbol for $K$ -component sub-Gaussian $\alpha$ -stable mixture distribution in which $\boldsymbol{\alpha}=(\alpha_{1},\dots,\alpha_{K})^{T}$ . Identifiability of the SG $\boldsymbol{\alpha}$ SM distribution with pdf in (2.2) is valid from [9].

Missing or incomplete observations frequently occur in the statistical studies. The EM algorithm, introduced in [7], is a popular inferential tool for such a situation. The application of EM technique also includes the cases that we encounter the latent variables or models with random parameter provided that they are formulated as a missing value problem. Let $L_{c}(\theta)=f(\boldsymbol{y},\boldsymbol{z}|\theta)$ be the complete data likelihood function in which $\boldsymbol{y}$ and $\boldsymbol{z}$ denote the vector of observed and latent observations, respectively. The EM algorithm works iteratively by maximizing the conditional expectation, $Q\left(\theta|\theta^{(t)}\right)$ , of the complete log-likelihood function given the available data and a current estimate $\theta^{(t)}$ of the parameter. Each iteration of EM algorithm involves two steps as the E-step (computing $Q\left(\theta|\theta^{(t)}\right)$ at $t$ -th iteration) and the M-step (maximizing $Q\left(\theta|\theta^{(t)}\right)$ with respect to $\theta$ to get $\theta^{(t+1)}$ ). The E- and M-steps are repeated until convergence occurs.

As the M-step of EM algorithm is analytically intractable, we imply this step with a sequence of conditional maximization, known as CM-step. This procedure is known as ECM algorithm, [17]. A faster extension of EM algorithm, i. e. the ECME algorithm introduced in [15]. It should be noted that all the EM, ECM, and ECME have the same E-step. The ECME algorithm works by maximizing the constrained $Q\left(\theta|\theta^{(t)}\right)$ via some CM-steps and maximizing the constrained marginal likelihood function and some constraints on the parameters, [2]. In cases where implementation of the EM algorithm is difficult, another extension of this algorithm, called stochastic EM (SEM) is useful, [6]. We imply SEM by simulating missing data from conditional density of $P^{(t)}_{i}$ given $y_{i}$ and $\theta^{(t)}$ with pdf $f(p^{(t)}_{i}|y_{i},\theta^{(t)})$ ; for $i=1,\dots,n$ , and substituting its sample $\boldsymbol{P}=(P^{(t)}_{1},\dots,P^{(t)}_{n})^{T}$ into the complete likelihood function. Then, we apply EM algorithm for the pseudo-complete sample $P^{(t)}_{1},\dots,P^{(t)}_{n}$ . This process is repeated until convergence occurs for the distribution of the $\{\theta^{(t+1)}\}$ . Under some mild regularity conditions, $\{\theta^{(t+1)}\}$ constitutes a Markov chain that converges to a stationary distribution, [11]. The SEM is generally very robust to the starting values, and the number of iterations is determined via an exploratory data analysis approach such as, graphical display, [11]. Using a burn-in of $N_{0}$ iterations, the sequence $\{\theta^{(t)}\}$ is expected to be close to some stationary point. After a sufficiently large number of iterations, say $N$ , the SEM estimation of $\theta$ is given by

[TABLE]

where $N_{0}$ is burn-in size. Upon above statements, each iteration of the SEM algorithm consists of two steps as follows.

Stochastic imputation (S-) step: Substitute the simulated missing values in the pseudo-complete log-likelihood function at $t$ -th iteration. 2. 2.

Maximization (M-) step: Find a $\theta$ , say $\theta^{(t+1)}$ , which maximizes pseudo-complete log-likelihood function at $t$ -th iteration.

The S- and M-steps, in above, are repeated until convergence occurs.

3 EM algorithm for SG $\alpha$ SM

We consider $\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{n},\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{n},{p}_{1},\dots,{p}_{n}$ as the complete data corresponding to (2.2) where $\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{n}$ are observed data, $\boldsymbol{z}_{1},\dots,\boldsymbol{z}_{n}$ , are component labels and ${p}_{1},\dots,{p}_{n}$ are missing observations. That is, if the $j$ -th component, for $j=1,\dots,K$ , of $\boldsymbol{Z}_{i}=(Z_{i1},\dots,Z_{iK})^{T}$ is one, then the other components are zero and $i$ -th observation is coming from $j$ -th component. This occurs with probability $w_{j}$ . We have

[TABLE]

independently and

[TABLE]

for $j=1,\dots,K$ and $i=1,\dots,n$ . It should be noted that given ${Z}_{ij}=1$ , $P_{i}$ s are independent. So, the complete data density function can be represented as

[TABLE]

where

[TABLE]

where $z_{ij}\in\{0,1\}$ and $\sum_{j=1}^{K}z_{ij}=1$ . It follows, from relations (3.3) and (3.4), that the complete data log-likelihood $l_{c}(\theta)$ has the following representation

[TABLE]

where ${C}$ is a constant independent of $\theta=(\alpha_{j},\Sigma_{j},\mu_{j};j=1,\dots,K)$ . Considering $l_{c}(\theta)$ as a function of component label and missing variable $P_{i}$ , its conditional expectation $Q\left(\theta|\theta^{(t)}\right)=E_{P}\left(l_{c}(\theta)|\boldsymbol{y},\theta^{(t)}\right)$ becomes

[TABLE]

where

[TABLE]

in which $f({\boldsymbol{y}}_{i};\alpha^{(t)}_{j},\Sigma^{(t)}_{j},\boldsymbol{\mu}^{(t)}_{j})$ is pdf of a sub-Gaussian $\alpha$ -stable random vector $\boldsymbol{Y}_{i}$ defined in (2.1) and

[TABLE]

E-step: The E-step is complete by computing $e^{(t)}_{1ij}$ and $e^{(t)}_{2ij}$ . Details for computing these quantities are given in Appendix 1. For this, we use package $\mathsf{STABLE}$ , [29].

M-step: The M-step of the EM algorithm updates the weight and location parameters of $j$ -th component in $(t+1)$ -th iteration as follows.

[TABLE]

The shape matrix can be updated analytically in M-step, but we prefer to update it along with the tail index in a CM-step (this reduces computational costs). At $(t+1)$ -th iteration, the updated quantities $e_{1ij}^{(t+1)}$ , $e_{2ij}^{(t+1)}$ , $w_{j}^{(t+1)}$ , and $\mu_{j}^{(t+1)}$ ; for $j=1,\dots,K$ and $i=1,\dots,n$ are evaluated from (3.7)-(3.10), respectively. Using these updates, we follow to update ${\alpha}_{j}$ and ${\Sigma}_{j}$ as $\alpha^{(t)}_{j}$ and $\Sigma_{j}^{(t)}$ ; for $j=1,\dots,K$ in the CM-step. It should be noted that the CM-step can be implemented ?using numerical optimization tools. But the use of the Remark 3.1, which suggests to use a stochastic EM (SEM) algorithm, leads to a mathematically and computationally tractable CM-step.

Remark 3.1

Suppose that $P$ is a positive stable random variable with tail index $\alpha/2$ and $E$ is an exponential random variable with mean one. Then, the quotient $E/P$ follows a Weibull distribution, independent of $E$ , with shape parameter $\alpha/2$ and scale parameter unity, [31].

•

First step of CM: We consider $K$ groups $I_{1},\dots,I_{K}$ . Let $e^{(t+1)}_{1i}=(e^{(t+1)}_{1ij},\dots,e^{(t+1)}_{1iK})$ , where $e^{(t)}_{1ij}$ is defined by (3.7). If the $j$ -th component of $e^{(t+1)}_{i1}$ is larger than the other components, then $\boldsymbol{Y}_{i}$ is assigned to the $j$ -th group $I_{j}$ ; for $i=1,\dots,n$ , $j=1,\dots,K$ . Now, $I_{j}$ whose size is $n_{j}$ consists of $\boldsymbol{Y}^{j}=(\boldsymbol{Y}^{j}_{1},\dots,\boldsymbol{Y}^{j}_{n_{j}})$ and $\sum_{j=1}^{K}n_{j}=n$ . Using (2.1) and remark 3.1, it follows that

[TABLE]

where $E_{i}$ is an exponential random variable with mean one independent of $P_{i}$ , and $N_{i}$ is a $d$ -dimensional zero-mean normal random vector with shape matrix $\Sigma^{(t+1)}_{j}$ . It is easy to check that,

[TABLE]

where $V^{j}_{i}$ follows a Weibull distribution with shape parameter $\alpha_{j}/2$ and scale parameter one. Therefore,

[TABLE]

for $m=1,\dots,n_{j}$ , $j=1,\dots,K$ . By considering $v^{j}_{i}$ as the missing observation, the complete data log-likelihood of $j$ -th group is

[TABLE]

where ${C}$ is a constant independent of $\alpha_{j}$ and $\Sigma_{j}$ ; for $j=1,\dots,K$ .

•

Second step of CM (first step of SEM): For group $I_{j}$ , simulate the vector $\boldsymbol{V}^{j}=(V^{j}_{1},\dots,V^{j}_{n_{j}})^{T}$ from conditional distribution of $V^{j}_{i}$ given $\boldsymbol{\cal{Y}}^{j}_{i}$ , $\alpha_{j}$ , and $\Sigma_{j}$ ; for $i=1,\dots,n_{j}$ and $j=1,\dots,K$ , using the Monte Carlo method, as described in Appendix 2, at the $N$ -th cycle of stochastic step.

•

Third step of CM (second step of SEM): Using the vector of pseudo sample $\boldsymbol{v}^{j}_{i}=(v^{j}_{1},\dots,v^{j}_{n_{j}})^{T}$ , maximize the right-hand side of (• ‣ 3) with respect to $\alpha_{j}$ and $\Sigma_{j}$ to find $\alpha_{j}^{(t+1)}$ and $\Sigma_{j}^{(t+1)}$ as

[TABLE]

and ${\alpha}^{(t+1)}_{j}$ is a solution of

[TABLE]

for $j=1,\dots,K$ .

Now replace $\alpha_{j}^{(t+1)}$ and $\Sigma_{j}^{(t+1)}$ at the right-hand side of (3.7) and (3.8), respectively. This completes E-step. Then, complete M-step by updating weight and location parameters at (3.9) and (3.10). Follow three steps of the CM-steps. Repeating this loop for $N$ times, the estimated parameters of SG $\boldsymbol{\alpha}$ SM, are obtained as the following.

[TABLE]

where $N_{0}$ is the size of burn-in for stochastic EM involved in CM-step and $\lambda_{j}^{(t)}$ is either $\alpha^{(t)}_{j}$ , $\Sigma_{j}^{(t)}$ , $\boldsymbol{\mu}^{(t)}_{j}$ , or ${w}^{(t)}_{j}$ ; in $t$ -th iteration of the EM algorithm for $j$ -th group. Our studies reveal that setting $N=70$ and $N_{0}=40$ in (3.12) provides satisfactory accuracy in estimations.

In order to implement the proposed EM algorithm, one can use the Bayesian information criterion (BIC) to estimate the number of clusters, $K$ . The BIC is defined as $BIC=m\log(n)-2\log(L)$ , where $\log(L)$ is the log-likelihood of observations under SG $\boldsymbol{\alpha}$ SM distribution, $m$ is the number of free parameters, and $n$ is the sample size, [14]. Determining $K$ , to implementing the proposed EM algorithm, we first use package cluster, [16], for pre-clustering and then package $\mathsf{STABLE}$ , [29], for initial estimates of the parameters within each cluster.

4 Simulation study and real data analysis

This section has two parts. Firstly, we compare the performance of the proposed EM algorithm to modelling data simulated from SG $\boldsymbol{\alpha}$ SM distribution. We also check the robustness of SG $\boldsymbol{\alpha}$ SM distribution when data generated from a mixture of $t$ (MT) distribution in the presence of a mixture of skew $t$ (MST), a mixture of normal (MN), a mixture of skew normal (MSN), and a mixture of generalized hyperbolic (MGH) distributions. Secondly, for synthetic data analysis, we choose four stocks among 30 stocks of Dow Jones data, [21]. Finally, for real data analysis, we focus on $\mathsf{banknote}$ data set which involves of six variables made on 100 genuine and 100 counterfeit Swiss bank notes. This dataset is available by loading package $\mathsf{MixGHD}$ , [32]. It should be noted that, during analyses, we use package $\mathsf{mixsmsn}$ to model data via MT, MST, MN, and MSN distributions, [22]. Also, package $\mathsf{MixGHD}$ is applied for modelling data by MGH distribution and computing adjusted Rand index (ARI) as a measure of performance, [10].

Example 1: Performance of the SG $\boldsymbol{\alpha}$ SM distribution is investigated through a small simulation study. For performance, we simulate 200 times realizations from 3-component SG $\boldsymbol{\alpha}$ SM distribution under settings: $n=600$ , $\boldsymbol{w}=(1/3,1/3,1/3)^{T}$ , $\boldsymbol{\mu}_{1}=(0,3)^{T}$ , $\boldsymbol{\mu}_{2}=(3,0)^{T}$ , $\boldsymbol{\mu}_{3}=(-3,0)^{T}$ , $\boldsymbol{\Sigma}_{1}=\begin{pmatrix}2&0.5\\ 0.5&0.5\\ \end{pmatrix}$ , $\boldsymbol{\Sigma}_{2}=\begin{pmatrix}1&0\\ 0&1\\ \end{pmatrix}$ , and $\boldsymbol{\Sigma}_{3}=\begin{pmatrix}2&-0.5\\ -0.5&0.5\\ \end{pmatrix}$ . We choose these settings of parameters from [23]. In following, Figure 1 displays the ARI computed under each of six mixture models. AS it is expected, Figure 1(a), the SG $\boldsymbol{\alpha}$ SM model shows the best performance in the sense of median of ARI. In order to investigate the robustness of the SG $\boldsymbol{\alpha}$ SM model, we simulate 200 times realizations from 3-component MT distribution under above settings (the degrees of freedoms for first, second, and third components are $\nu_{1}=2$ , $\nu_{2}=4$ , and $\nu_{3}=8$ , respectively). Figure 1(b) shows the computed ARIs when data are generated from MT distribution. As it is seen, surprisingly, MGH and SG $\boldsymbol{\alpha}$ SM model shows better performance than MT based on the median of ARI.

Example 2: We choose stocks AXP, JPM, MCD, and SBC stocks from Dow Jones data. It can be checked that a symmetric $\alpha$ -stable pdf captures well the empirical distribution of these stocks based on 1247 observations with almost the same tail indices. Also, the joint scatterplots of these stocks are roughly elliptical. This means that a SG $\alpha$ S distribution is suitable for modelling these stocks, [21]. Define $\boldsymbol{X}_{1}=(\text{JPM},\text{MCD})^{T}$ and $\boldsymbol{X}_{2}=(\text{AXP}-\delta,\text{SBC})^{T}$ . The scatterplots of $\boldsymbol{X}_{1}$ and $\boldsymbol{X}_{2}$ have a perfect overlay when $\delta$ is zero and are well-separated when $\delta$ is large (say $\delta>0.3$ ). In the following, Figure 2 displays the computed ARI for $\delta$ =0.12,0.1,0.07,0.06,0.045,0.03,0.025. As it is seen, SG $\boldsymbol{\alpha}$ SM distribution shows the best performance.

Example 3: Among variables, we choose the width of left edge (left) and bottom margin width (bottom) from $\mathsf{banknote}$ data, [8]. Computed ARIs correspond to SG $\boldsymbol{\alpha}$ SM model, MT, MST, MN, MSN, and MGH are 0.721102, 0.704122, 0, 0.704122, 0, and 0.7041185, respectively. This report indicates that SG $\boldsymbol{\alpha}$ SM gives the best performance.

5 Conclusion remarks

The E-step of the EM algorithm for calculating maximum likelihood estimates of the sub-Gaussian $\alpha$ -stable mixture (SG $\boldsymbol{\alpha}$ SM) distribution parameters is not tractable computationally. We propose here some methodology that makes it possible to evaluate the intractable E-step for the SG $\boldsymbol{\alpha}$ SM distribution. We assume that the number of components is known and starting values for the EM algorithm are estimated using statistical packages have provided for clustering. A simulation study reveals that the proposed EM algorithm is robust against to starting values, outliers, and deviations from model assumptions. This is proved when data are generated from a mixture of $t$ distributions. Also, the performance of the proposed EM algorithm is demonstrated using synthetic and real data. We hope practitioners find this model useful for practical purposes. As a possible future work, we would like to develop the methodology proposed here to operator SG $\boldsymbol{\alpha}$ SM distribution in which, components of each cluster have different tail weights.

Appendix 1

At $(t+1)$ -th iteration of the E-step, to compute $e^{(t+1)}_{1ij}$ , we need computing the pdf of a SG $\alpha$ S, i.e. $f\bigl{(}{\boldsymbol{y}}_{i};\alpha^{(t)}_{j},\Sigma^{(t)}_{j},\boldsymbol{\mu}^{(t)}_{j}\bigr{)}$ . Also, it can be checked that

[TABLE]

Both of $e^{(t+1)}_{1ij}$ and $e^{(t+1)}_{2ij}$ are computed using package $\mathsf{STABLE}$ .

Appendix 2

To simulate the pseudo-complete data from conditional distribution of $V^{j}_{i}$ given $\boldsymbol{\cal{Y}}^{j}_{i}$ , $\alpha_{j}$ , and $\Sigma_{j}$ ; for $i=1,\dots,n_{j}$ and $j=1,\dots,K$ , we use rejection sampling by the following steps. We have our idea from [30] as follows. We note that the density function

[TABLE]

as a part of conditional (posterior) density function

[TABLE]

is bounded by some constant independent of $v^{j}_{i}$ . More precisely, by differentiating density $f_{\boldsymbol{\cal{Y}}^{j}_{i}|V^{j}_{i},\alpha_{j},\Sigma_{j}}(\boldsymbol{\cal{Y}}^{j}_{i}|v^{j}_{i},\alpha_{j},\Sigma_{j})$ with respect to $v^{j}_{i}$ , it turns out that $f_{\boldsymbol{\cal{Y}}^{j}_{i}|V^{j}_{i},\alpha_{j},\Sigma_{j}}(\boldsymbol{\cal{Y}}^{j}_{i}|v^{j}_{i},\alpha_{j},\Sigma_{j})$ attains its maximum as

[TABLE]

at point $v^{j}_{i}=\sqrt{\frac{d}{((\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i})}}$ . Hence, the rejection sampling approach is employed to generate from the posterior distribution by the following steps.

Simulate a sample, say $v^{j}_{i}$ , from a Weibull distribution with shape parameter $\alpha_{j}$ and scale unity. 2. 2.

Define $b=\frac{d^{d/2}\left((\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}\right)^{-d/2}\exp\{-d/2\}}{(2\pi)^{d/2}|\Sigma_{j}|^{1/2}}$ and generate a sample from a uniform distribution $U\left(0,b\right)$ , say $u$ . 3. 3.

If $u<\frac{(v^{j}_{i})^{d}\exp\{-\frac{1}{2}\left((\boldsymbol{\cal{Y}}^{j}_{i})^{T}\Sigma^{-1}_{j}\boldsymbol{\cal{Y}}^{j}_{i}\right)(v^{j}_{i})^{2}\}}{(2\pi)^{d/2}|\Sigma_{j}|^{1/2}}$ , then accept $v^{j}_{i}$ as an observation pdf $V^{j}_{i}$ given $\boldsymbol{\cal{Y}}^{j}_{i}$ , $\alpha_{j}$ , and $\Sigma_{j}$ ; for $i=1,\dots,n_{j}$ and $j=1,\dots,K$ ; otherwise, go to step 1.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Andrews, J. L. and Mc Nicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t 𝑡 t -distributions. Statistics and Computing , 22, 1021-1029.
2[2] Basso, R. M., Lachos, V. H. Cabral, C. R. B., and Ghosh, P. (2010). Robust mixture modeling based on scale mixtures of skew-normal distributions, Computational Statistics and Data Analysis , 54, 2926-2941.
3[3] Bodnar, T. and Gupta, A. K. (2011). Estimation of the precision matrix of a multivariate elliptically contoured stable distribution, Statistics , 45(2), 131-142.
4[4] Browne, R. P. and Mc Nicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. The Canadian Journal of Statistics , 43(2), 176-198.
5[5] Cabral, C. R. B., Lachos, V. H., and Prates, M. O. (2012). Multivariate Mixture Modeling Using Skew-Normal Independent Distributions. Computational Statistics and Data Analysis , 56, 126-142.
6[6] Celeux, G. and Diebolt, J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for mixture problem, Computational Statistics Quarterly , 2 (1), 73-82.
7[7] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B , 39, 1-38.
8[8] Fraley, C., Raftery, A. E., and Scrucca, L. (2016). mclust : Normal Mixture Modeling for Model-Based Clustering, ?Classification, and Density Estimation, R package version 5.2.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Robust mixture modelling using

1 Introduction

2 Preliminaries

3 EM algorithm for SGα\alphaαSM

Remark 3.1

4 Simulation study and real data analysis

5 Conclusion remarks

Appendix 1

Appendix 2

3 EM algorithm for SG $\alpha$ SM