Latent Mixture Modeling for Clustered Data

Shonosuke Sugasawa; Genya Kobayashi; Yuki Kawakubo

arXiv:1704.05993·stat.ME·September 10, 2019·Stat. Comput.

Latent Mixture Modeling for Clustered Data

Shonosuke Sugasawa, Genya Kobayashi, Yuki Kawakubo

PDF

Open Access

TL;DR

This paper introduces a novel mixture modeling approach for clustered data, utilizing latent experts and Dirichlet-distributed mixing proportions, estimated via a Monte Carlo EM algorithm, with extensions for covariate dependence.

Contribution

It develops a new mixture-of-experts model for clustered data with a Monte Carlo EM estimation method and covariate-dependent mixing proportions.

Findings

01

The proposed model outperforms existing mixture models in simulations.

02

It effectively captures cluster-specific distributions.

03

Application to Japanese land price data demonstrates practical utility.

Abstract

This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price…

Equations68

f_{i} (y ∣ \boldmath x) = k = 1 \sum K π_{ik} h_{ik} (y ∣ \boldmath x),

f_{i} (y ∣ \boldmath x) = k = 1 \sum K π_{ik} h_{ik} (y ∣ \boldmath x),

f_{i} (y ∣ \boldmath x) = k = 1 \sum K π_{ik} h_{k} (y ∣ \boldmath x),

f_{i} (y ∣ \boldmath x) = k = 1 \sum K π_{ik} h_{k} (y ∣ \boldmath x),

f_{i} (y ∣ \boldmath π_{i}, \boldmath x, \boldmath ϕ) = k = 1 \sum K π_{ik} h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}),

f_{i} (y ∣ \boldmath π_{i}, \boldmath x, \boldmath ϕ) = k = 1 \sum K π_{ik} h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}),

p({\text{\boldmath$\pi$}}_{i}|{\text{\boldmath$\alpha$}})=\frac{\Gamma\big{(}\sum_{k=1}^{K}{\alpha}_{k}\big{)}}{\prod_{k=1}^{K}\Gamma({\alpha}_{k})}\prod_{k=1}^{K}\pi_{ik}^{{\alpha}_{k}-1}

p({\text{\boldmath$\pi$}}_{i}|{\text{\boldmath$\alpha$}})=\frac{\Gamma\big{(}\sum_{k=1}^{K}{\alpha}_{k}\big{)}}{\prod_{k=1}^{K}\Gamma({\alpha}_{k})}\prod_{k=1}^{K}\pi_{ik}^{{\alpha}_{k}-1}

f_{i} (y ∣ \boldmath x, \boldmath α, \boldmath ϕ) = k = 1 \sum K p_{k} h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}), p_{k} = \frac{α _{k}}{\sum _{ℓ = 1}^{K} α _{ℓ}},

f_{i} (y ∣ \boldmath x, \boldmath α, \boldmath ϕ) = k = 1 \sum K p_{k} h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}), p_{k} = \frac{α _{k}}{\sum _{ℓ = 1}^{K} α _{ℓ}},

\tilde{f}_{i} (y ∣ \boldmath x, \boldmath α, \boldmath ϕ) = k = 1 \sum K E [π_{ik} ∣ Y_{i}] h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}) .

\tilde{f}_{i} (y ∣ \boldmath x, \boldmath α, \boldmath ϕ) = k = 1 \sum K E [π_{ik} ∣ Y_{i}] h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}) .

1st stage: 2nd stage: 3rd stage: y_{ij} ∣ \boldmath x_{ij}, (z_{ij} = k) \sim F_{k} (\boldmath x_{ij}, \boldmath ϕ_{k}), z_{ij} ∣ \boldmath π_{i} \sim Cat (K, \boldmath π_{i}), \boldmath π_{i} \sim Dir (\boldmath α),

1st stage: 2nd stage: 3rd stage: y_{ij} ∣ \boldmath x_{ij}, (z_{ij} = k) \sim F_{k} (\boldmath x_{ij}, \boldmath ϕ_{k}), z_{ij} ∣ \boldmath π_{i} \sim Cat (K, \boldmath π_{i}), \boldmath π_{i} \sim Dir (\boldmath α),

L (\boldmath θ) = (\frac{Γ ( \sum _{k = 1}^{K} α _{k} )}{\prod _{k = 1}^{K} Γ ( α _{k} )})^{m} i = 1 \prod m \boldmath z_{i} \sum \frac{\prod _{k = 1}^{K} Γ ( \sum _{j = 1}^{n_{i}} w _{ij k} + α _{k} )}{Γ ( n _{i} + \sum _{k = 1}^{K} α _{k} )} (j = 1 \prod n_{i} k = 1 \prod K h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k})^{w_{ij k}}),

L (\boldmath θ) = (\frac{Γ ( \sum _{k = 1}^{K} α _{k} )}{\prod _{k = 1}^{K} Γ ( α _{k} )})^{m} i = 1 \prod m \boldmath z_{i} \sum \frac{\prod _{k = 1}^{K} Γ ( \sum _{j = 1}^{n_{i}} w _{ij k} + α _{k} )}{Γ ( n _{i} + \sum _{k = 1}^{K} α _{k} )} (j = 1 \prod n_{i} k = 1 \prod K h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k})^{w_{ij k}}),

\ell^{c}({\text{\boldmath$\theta$}},{\text{\boldmath$z$}},{\text{\boldmath$\pi$}})=\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}\sum_{k=1}^{K}I(z_{ij}=k)\log\big{\{}\pi_{ik}h_{k}(y_{ij}|{\text{\boldmath$x$}}_{ij},{\text{\boldmath$\phi$}}_{k})\big{\}}+\sum_{i=1}^{m}\log p({\text{\boldmath$\pi$}}_{i}|{\text{\boldmath$\alpha$}}),

\ell^{c}({\text{\boldmath$\theta$}},{\text{\boldmath$z$}},{\text{\boldmath$\pi$}})=\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}\sum_{k=1}^{K}I(z_{ij}=k)\log\big{\{}\pi_{ik}h_{k}(y_{ij}|{\text{\boldmath$x$}}_{ij},{\text{\boldmath$\phi$}}_{k})\big{\}}+\sum_{i=1}^{m}\log p({\text{\boldmath$\pi$}}_{i}|{\text{\boldmath$\alpha$}}),

Q (\boldmath θ ∣ \boldmath θ^{(t)}) = E [ℓ^{c} (\boldmath θ, \boldmath z, \boldmath π) ∣ Y, \boldmath θ^{(t)}],

Q (\boldmath θ ∣ \boldmath θ^{(t)}) = E [ℓ^{c} (\boldmath θ, \boldmath z, \boldmath π) ∣ Y, \boldmath θ^{(t)}],

Q (\boldmath θ ∣ \boldmath θ^{(l)}) \approx \frac{1}{L} l = 1 \sum L ℓ^{c} (\boldmath θ, \boldmath z^{(l)}, \boldmath π^{(l)}),

Q (\boldmath θ ∣ \boldmath θ^{(l)}) \approx \frac{1}{L} l = 1 \sum L ℓ^{c} (\boldmath θ, \boldmath z^{(l)}, \boldmath π^{(l)}),

z_{ij} ∣ \boldmath π_{i}, Y \sim Cat (K, \boldmath p_{ij}), i = 1, \dots, m, j = 1, \dots, n_{i}, \boldmath π_{i} ∣ \boldmath z, Y \sim Dir (\boldmath a_{i}), i = 1, \dots, m,

z_{ij} ∣ \boldmath π_{i}, Y \sim Cat (K, \boldmath p_{ij}), i = 1, \dots, m, j = 1, \dots, n_{i}, \boldmath π_{i} ∣ \boldmath z, Y \sim Dir (\boldmath a_{i}), i = 1, \dots, m,

p_{ij k} = \frac{π _{ik} h _{k} ( y _{ij} ∣ \boldmath x _{ij} , \boldmath ϕ _{k}^{(t)} )}{\sum _{l = 1}^{K} π _{i l} h _{l} ( y _{ij} ∣ \boldmath x _{ij} , \boldmath ϕ _{l}^{(t)} )}, and a_{ik} = α_{k}^{(t)} + j = 1 \sum n_{i} I (z_{ij} = k) .

p_{ij k} = \frac{π _{ik} h _{k} ( y _{ij} ∣ \boldmath x _{ij} , \boldmath ϕ _{k}^{(t)} )}{\sum _{l = 1}^{K} π _{i l} h _{l} ( y _{ij} ∣ \boldmath x _{ij} , \boldmath ϕ _{l}^{(t)} )}, and a_{ik} = α_{k}^{(t)} + j = 1 \sum n_{i} I (z_{ij} = k) .

Q (\boldmath θ ∣ \boldmath θ^{(t)}) = C + i = 1 \sum m j = 1 \sum n_{i} k = 1 \sum K z_{ij k}^{*} lo g h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k}) + i = 1 \sum m E [lo g p (\boldmath π_{i} ∣ \boldmath α) ∣ Y, \boldmath θ^{(t)}],

Q (\boldmath θ ∣ \boldmath θ^{(t)}) = C + i = 1 \sum m j = 1 \sum n_{i} k = 1 \sum K z_{ij k}^{*} lo g h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k}) + i = 1 \sum m E [lo g p (\boldmath π_{i} ∣ \boldmath α) ∣ Y, \boldmath θ^{(t)}],

\begin{split}&{\widehat{\text{\boldmath$\phi$}}}_{k}=\underset{{\text{\boldmath$\phi$}}_{k}}{\rm argmax}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}\log h_{k}(y_{ij}|{\text{\boldmath$x$}}_{ij},{\text{\boldmath$\phi$}}_{k}),\ \ \ \ k=1,\ldots,K,\\ &{\widehat{\text{\boldmath$\alpha$}}}=\underset{{\text{\boldmath$\alpha$}}}{\rm argmax}\left\{m\log\Gamma\Big{(}\sum_{k=1}^{K}{\alpha}_{k}\Big{)}-m\sum_{k=1}^{K}\log\Gamma({\alpha}_{k})+\sum_{k=1}^{K}{\alpha}_{k}\sum_{i=1}^{m}(\log\pi_{ik})^{\ast}\right\},\end{split}

\begin{split}&{\widehat{\text{\boldmath$\phi$}}}_{k}=\underset{{\text{\boldmath$\phi$}}_{k}}{\rm argmax}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}\log h_{k}(y_{ij}|{\text{\boldmath$x$}}_{ij},{\text{\boldmath$\phi$}}_{k}),\ \ \ \ k=1,\ldots,K,\\ &{\widehat{\text{\boldmath$\alpha$}}}=\underset{{\text{\boldmath$\alpha$}}}{\rm argmax}\left\{m\log\Gamma\Big{(}\sum_{k=1}^{K}{\alpha}_{k}\Big{)}-m\sum_{k=1}^{K}\log\Gamma({\alpha}_{k})+\sum_{k=1}^{K}{\alpha}_{k}\sum_{i=1}^{m}(\log\pi_{ik})^{\ast}\right\},\end{split}

\displaystyle{\widehat{\text{\boldmath$\beta$}}}_{k}=\bigg{(}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}{\text{\boldmath$x$}}_{ij}{\text{\boldmath$x$}}_{ij}^{t}\bigg{)}^{-1}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}{\text{\boldmath$x$}}_{ij}y_{ij},

\displaystyle{\widehat{\text{\boldmath$\beta$}}}_{k}=\bigg{(}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}{\text{\boldmath$x$}}_{ij}{\text{\boldmath$x$}}_{ij}^{t}\bigg{)}^{-1}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}{\text{\boldmath$x$}}_{ij}y_{ij},

\displaystyle{\widehat{\sigma}}^{2}_{k}=\bigg{(}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}\bigg{)}^{-1}\sum_{i=1}^{m}\sum_{j=1}^{n_{i}}z_{ijk}^{\ast}(y_{ij}-{\text{\boldmath$x$}}_{ij}^{t}{\widehat{\text{\boldmath$\beta$}}}_{k})^{2}.

AIC

AIC

BIC

f_{i}^{m} (\boldmath y_{i} ∣ \boldmath x_{i}, \boldmath θ) = \int {j = 1 \prod n_{i} k = 1 \sum K π_{ik} h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k})} p (\boldmath π_{i} ∣ \boldmath α) d \boldmath π_{i}

f_{i}^{m} (\boldmath y_{i} ∣ \boldmath x_{i}, \boldmath θ) = \int {j = 1 \prod n_{i} k = 1 \sum K π_{ik} h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k})} p (\boldmath π_{i} ∣ \boldmath α) d \boldmath π_{i}

f_{i}^{m} (y_{i} ∣ \boldmath x_{i}, \boldmath θ) \approx \frac{1}{B} b = 1 \sum B {j = 1 \prod n_{i} k = 1 \sum K π_{ik}^{* (b)} h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k})},

f_{i}^{m} (y_{i} ∣ \boldmath x_{i}, \boldmath θ) \approx \frac{1}{B} b = 1 \sum B {j = 1 \prod n_{i} k = 1 \sum K π_{ik}^{* (b)} h_{k} (y_{ij} ∣ \boldmath x_{ij}, \boldmath ϕ_{k})},

f_{i} (y ∣ \boldmath x) = k = 1 \sum K^{*} π_{ik} h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}),

f_{i} (y ∣ \boldmath x) = k = 1 \sum K^{*} π_{ik} h_{k} (y ∣ \boldmath x, \boldmath ϕ_{k}),

\boldmath π_{i} \sim Dir (\boldmath α_{i}), \boldmath α_{i} = (α_{i 1}, \dots, α_{i K})^{t}, α_{ik} = exp (\boldmath w_{i}^{t} \boldmath γ_{k}),

\boldmath π_{i} \sim Dir (\boldmath α_{i}), \boldmath α_{i} = (α_{i 1}, \dots, α_{i K})^{t}, α_{ik} = exp (\boldmath w_{i}^{t} \boldmath γ_{k}),

E [π_{ik}] = \frac{exp ( \boldmath w _{i}^{t} \boldmath γ _{k} )}{\sum _{k = 1}^{K} exp ( \boldmath w _{i}^{t} \boldmath γ _{k} )} .

E [π_{ik}] = \frac{exp ( \boldmath w _{i}^{t} \boldmath γ _{k} )}{\sum _{k = 1}^{K} exp ( \boldmath w _{i}^{t} \boldmath γ _{k} )} .

a_{ik} = exp (\boldmath w_{i}^{t} \boldmath γ_{k}^{(t)}) + j = 1 \sum n_{i} I (z_{ij} = k),

a_{ik} = exp (\boldmath w_{i}^{t} \boldmath γ_{k}^{(t)}) + j = 1 \sum n_{i} I (z_{ij} = k),

Q (\boldmath γ)

Q (\boldmath γ)

+ i = 1 \sum m k = 1 \sum K exp (\boldmath w_{i}^{t} \boldmath γ_{k}) (lo g π_{ik})^{*},

f_{i} (y ∣ x) = π_{i} ϕ (y; - 1 + x, 1) + (1 - π_{i}) ϕ (y; 1 - x, 1), π_{i} \sim Beta (5, 3),

f_{i} (y ∣ x) = π_{i} ϕ (y; - 1 + x, 1) + (1 - π_{i}) ϕ (y; 1 - x, 1), π_{i} \sim Beta (5, 3),

f_{i} (y ∣ x) = I (1 \leq i \leq 15) ϕ (y; - 1 + 2 x, 0. 5^{2}) + I (16 \leq i \leq 30) ϕ (y; 1.5 + x, 1)

+ I (31 \leq i \leq 50) ϕ (y; - x, 1. 5^{2}),

MISE_{i} = \frac{1}{R} r = 1 \sum R \int {f_{i}^{(r)} (t ∣ x) - f_{i} (t ∣ x)}^{2} d t, i = 1, \dots, m,

MISE_{i} = \frac{1}{R} r = 1 \sum R \int {f_{i}^{(r)} (t ∣ x) - f_{i} (t ∣ x)}^{2} d t, i = 1, \dots, m,

f_{i} (y ∣ x) = π_{i} ϕ (y; - 1 + x, 1) + (1 - π_{i}) ϕ (y; 1 - x, 1), π_{i} \sim Beta (α_{i 1}, α_{i 1}),

f_{i} (y ∣ x) = π_{i} ϕ (y; - 1 + x, 1) + (1 - π_{i}) ϕ (y; 1 - x, 1), π_{i} \sim Beta (α_{i 1}, α_{i 1}),

α_{i 1} = exp (1 + 0.6 w_{i}), α_{i 2} = exp (1 - 0.5 w_{i}), w_{i} \sim Ber (0.4) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Statistical Methods and Inference

Full text

Latent Mixture Modeling for Clustered Data111This version:

SHONOSUKE SUGASAWA

*Risk Analysis Research Center, The Institute of Statistical Mathematics

*GENYA KOBAYASHI

*Graduate School of Social Sciences, Chiba University

*YUKI KAWAKUBO

Graduate School of Social Sciences, Chiba University

Abstract. This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price data in Japan.

Key words: conditional distribution; Monte Carlo EM algorithm; hierarchical model; mixture modeling; random effect

Introduction

Grouped or clustered data often arise in many scientific fields such as econometrics, epidemiology, and genetics. Although the mixed-effects model (Demidenko, 2004) has been widely used for such data, it fundamentally aims at modeling conditional means in each cluster, which could be inappropriate if the data distribution is skewed or multimodal. As an alternative modeling strategy, the finite mixture model (McLachlan and Peel, 2000) has been extensively applied for its flexibility to capture the within-cluster heterogeneity in the data. For modeling independent data, the mixture model with covariates was originally proposed in Jacob et al. (1991), known as mixture-of-experts. To date, a large body of literature has been concerned with flexible modeling of the conditional density for independent data. For example, see Jordan and Jacobs (1994), Hurn et al. (2003), Geweke and Keane (2007), Villani et al. (2009), Villani et al. (2012) and Nguyen and McLachlan (2016).

However, the existing models for independent data are not suitable for estimating cluster-wise conditional distributions. If we globally apply the mixture models to a whole dataset ignoring the clustering labels (we call global mixture modeling), the estimated conditional distributions are the same over all clusters, which is clearly inappropriate in clustered data analysis. On the other hand, applying the mixture models independently to each cluster in order to capture the cluster heterogeneity (we call local mixture modeling) leads to unstable results since the within-cluster samples sizes are usually not large in practice. Hence, another flexible modeling strategy for clustered data is desired. Up to now, several methods have been proposed for modeling cluster-wise distributions. Rubin and Wu (1997) proposed a mixture of linear mixed-effects models. Sun et al. (2007) developed a mixture of linear models with the random effects used in the generalized linear model for the mixing proportions. Rosen et al. (2000) and Tang and Qu (2016) used the generalized estimating equation approach to estimate the component distributions by incorporating the correlations within clusters.

In this article, we propose a compromised model between the global and local mixture modeling. Note that the local mixture model can be expressed as

[TABLE]

where $y$ is the response variable, $x$ is the vector of covariates, and $h_{ik}$ is the component distribution for the $k$ th component of the $i$ th cluster with the mixing proportion $\pi_{ik}$ satisfying $\sum_{k=1}^{K}\pi_{ik}=1$ . Since the within-cluster sample size is usually small in practice, $h_{ik}(y|{\text{\boldmath$ x $}})$ would not be stably estimated. Hence, we restrict $h_{ik}(y|{\text{\boldmath$ x $}})=h_{k}(y|{\text{\boldmath$ x $}})$ , that is, the component distributions are the same over all the clusters like global modeling. Then the model reduces to

[TABLE]

which can be interpreted as there exists $K$ latent distributions and each cluster-wise distribution $f_{i}(y|{\text{\boldmath$ x $}})$ is expressed by these distributions with cluster-wise mixing proportions $\pi_{ik}$ . Hence, as long as $K$ is a moderate number, one can estimate $K$ component distributions with reasonable accuracy. On the other hand, estimating unstructured $\pi_{ik}$ is not feasible since the number of $\pi_{ik}$ ’s grows as the number of clusters increases. To overcome this difficulty, we assume that the vector of proportions ${\text{\boldmath$ \pi $}}_{i}=(\pi_{i1},\ldots,\pi_{iK})^{t}$ that characterizes the conditional distribution of the $i$ th cluster, is a realization from a multivariate distribution. Therefore, ${\text{\boldmath$ \pi $}}_{i}$ plays a similar role to the random effect in the context of the mixed-effects model. As a distribution of ${\text{\boldmath$ \pi $}}_{i}$ , we use the Dirichlet distribution, which allows us to develop a tractable estimating method for model parameters.

In this article, the model parameters are estimated based on a likelihood-based approach. The model can be viewed as a three-stage hierarchical model, where the first stage consists of the model for the response variable, the second stage consists of the latent variables which assign the latent distribution, and the third stage consists of the model for the mixing proportions. We develop a Monte Carlo Expectation-Maximization (MCEM) algorithm (Dempster et al., 1977; Wei and Tanner, 1990) for parameter estimation of which the E-step is consist of a simple Gibbs sampling scheme for imputing the latent variables. Since the number of latent distributions $K$ is generally unknown, we consider selecting $K$ based on the Akaike information criteria (AIC) or Bayesian information criteria (BIC), where the maximum log-marginal likelihood can be easily computed from a simple Monte Carlo approximation.

The rest of the paper is organized as follows: Section 2 describes the proposed model in detail and develops the MCEM algorithm for maximizing the marginal likelihood. In Section 3, the performance of the proposed method is demonstrated along with some existing methods through simulation studies. An application to the real data set is also presented. In Section 4, some discussion is provided.

Latent Mixture Model

Model setup

Suppose that we have the clustered (grouped) observations $y_{ij}$ , $i=1,\dots,m$ , $j=1,\dots,n_{i}$ , with an associated $p$ -dimensional vector of covariates ${\text{\boldmath$ x $}}_{ij}$ . Let $f_{i}(y|{\text{\boldmath$ x $}})$ be a density or probability mass function of $y_{ij}$ given ${\text{\boldmath$ x $}}_{ij}$ , which are the same within clusters but different across clusters. Our aim is to estimate the cluster-wise conditional density $f_{i}(y|{\text{\boldmath$ x $}})$ from the data set $\{y_{ij},{\text{\boldmath$ x $}}_{ij}\}$ . To this end, we consider the following latent mixture model:

[TABLE]

where $\pi_{ik}$ is the weight for the $k$ th component in the $i$ th cluster, $h_{k}(\cdot|\cdot,{\text{\boldmath$ \phi $}}_{k}),\ k=1,\ldots,K$ are the latent conditional densities characterized by the parameter ${\text{\boldmath$ \phi $}}_{k}$ , and $K$ is the unknown number of latent densities. Moreover, we assume that the mixing proportions ${\text{\boldmath$ \pi $}}_{i}$ ’s are independent realizations from the Dirichlet distribution with the density

[TABLE]

for $i=1,\ldots,m$ , where $\Gamma(\cdot)$ denotes the gamma function and ${\text{\boldmath$ \alpha $}}=({\alpha}_{1},\ldots,{\alpha}_{K})^{t}$ is a vector of unknown parameters. In this article, we let (1) and (2) together denote the latent mixture model. The unknown model parameters to be estimated are ${\text{\boldmath$ \phi $}}_{1},\ldots,{\text{\boldmath$ \phi $}}_{K}$ in latent distributions and $\alpha$ in the Dirichlet distribution. Under the setting (1) and (2), taking expectation of $\pi_{ik}$ with respect to Dir $({\text{\boldmath$ \alpha $}})$ , we have

[TABLE]

which is referred to the marginal model, and is common over all the clusters. Hence, we can observe that ${\text{\boldmath$ \pi $}}_{i}$ characterizes the cluster-wise conditional density and plays a similar role to the random effects in the context of mixed-effects models. The mixing proportion ${\text{\boldmath$ \pi $}}_{i}$ can be estimated by the conditional expectation ${\rm E}[{\text{\boldmath$ \pi $}}_{i}|Y]$ , where $Y$ is a set of all the response variables. Under (1) and (2), response variables in different clusters are mutually independent, so that it holds ${\rm E}[{\text{\boldmath$ \pi $}}_{i}|Y]={\rm E}[{\text{\boldmath$ \pi $}}_{i}|Y_{i}]$ with $Y_{i}=\{y_{i1},\ldots,y_{in_{i}}\}$ . Then, if the model parameters are known, the estimator of the cluster-wise conditional density is given by

[TABLE]

Generally speaking, the conditional expectation ${\rm E}[\pi_{ik}|Y_{i}]$ tends close to the marginal mean $p_{k}$ if the cluster-specific sample size $n_{i}$ is small, so that the estimated conditional density would be close to the marginal model (3). On the other hand, in clusters with relatively large $n_{i}$ , the estimated conditional density might vary from the marginal model (3), depending on the information of $Y_{i}$ . Therefore, this model allows us to carry out a kind of shrinkage estimation of the cluster-wise conditional densities.

As often done in estimating mixture models, by introducing the latent component indicator $z_{ij}\in\{1,\dots,K\}$ , the proposed model (1) and (2) can be expressed in the three-stage hierarchical model:

[TABLE]

where $F_{k}$ is the distribution having density $h_{k}$ , and $\text{Cat}(K,{\text{\boldmath$ \pi $}}_{i})$ is the categorical distribution on $\{1,\ldots,K\}$ with the probability vector ${\text{\boldmath$ \pi $}}_{i}$ . In hierarchy (5), ${\text{\boldmath$ z $}}_{ij}$ and ${\text{\boldmath$ \pi $}}_{i}$ are the latent variables. The latent density $h_{k}$ is determined by the user and the generalized linear model is an attractive choice. For example, $F_{k}({\text{\boldmath$ x $}}_{ij},{\text{\boldmath$ \phi $}}_{k})=N({\text{\boldmath$ x $}}_{ij}^{t}{\text{\boldmath$ \beta $}}_{k},\sigma_{k}^{2})$ when $y_{ij}$ is a continuous variable, and $F_{k}({\text{\boldmath$ x $}}_{ij},{\text{\boldmath$ \phi $}}_{k})=\text{Po}(\exp({\text{\boldmath$ x $}}^{t}{\text{\boldmath$ \beta $}}_{k}))$ when $y_{ij}$ is a counting variable.

Monte Carlo EM algorithm for parameter estimation

For completion of the conditional density (4), we need to estimate the unknown model parameters ${\text{\boldmath$ \theta $}}=\{{\text{\boldmath$ \phi $}}_{1},\ldots,{\text{\boldmath$ \phi $}}_{K},{\text{\boldmath$ \alpha $}}\}$ based on the data. Under the hierarchical formulation (5), the marginal likelihood function $L({\text{\boldmath$ \theta $}})$ is expressed as

[TABLE]

where $w_{ijk}=I(z_{ij}=k)$ and $\sum_{{\text{\boldmath$ z $}}_{i}}$ denotes the summation over the all combination of ${\text{\boldmath$ z $}}_{i}\in\{1,\ldots,K\}^{n_{i}}$ . Hence, a direct maximization of the marginal likelihood is not feasible since evaluation of the likelihood function $L({\text{\boldmath$ \theta $}})$ requires the summation over $K^{n_{i}}$ elements for each $i$ , which is computationally prohibitive even for small $K$ . Moreover, since the functional form of $L({\text{\boldmath$ \theta $}})$ is complex and not familiar, the brute force maximization of $L({\text{\boldmath$ \theta $}})$ is not realistic.

Instead, we exploit the hierarchical representation (5) and develop the EM algorithm (Dempster et al., 1977) which indirectly and iteratively maximizes $L({\text{\boldmath$ \theta $}})$ . Let ${\text{\boldmath$ \pi $}}=\left\{{\text{\boldmath$ \pi $}}_{1},\ldots,{\text{\boldmath$ \pi $}}_{m}\right\}$ and ${\text{\boldmath$ z $}}=\left\{{\text{\boldmath$ z $}}_{1},\ldots,{\text{\boldmath$ z $}}_{m}\right\}$ . Then, the complete log-likelihood function $\ell^{c}$ of (5) is given by

[TABLE]

where $p({\text{\boldmath$ \pi $}}_{i}|{\text{\boldmath$ \alpha $}})$ denotes the density function of ${\rm Dir}({\text{\boldmath$ \alpha $}})$ . Then, given the value of $\theta$ in the $t$ th iteration denoted by ${\text{\boldmath$ \theta $}}^{(t)}$ , the E-step entails the imputation of the latent variables $z$ and $\pi$ by taking expectation

[TABLE]

where the expectation is taken with respect to the posterior distribution of $({\text{\boldmath$ w $}},{\text{\boldmath$ \pi $}})$ given all the response variables $Y$ . However, since an analytical form of $Q({\text{\boldmath$ \theta $}}|{\text{\boldmath$ \theta $}}^{(t)})$ is not available, we consider Monte Carlo approximation of $Q({\text{\boldmath$ \theta $}}|{\text{\boldmath$ \theta $}}^{(t)})$ as

[TABLE]

where $L$ is a sufficiently large number, and ${\text{\boldmath$ z $}}^{(l)}$ and ${\text{\boldmath$ \pi $}}^{(l)}$ are the $l$ th random sample generated from the posterior distribution of $({\text{\boldmath$ z $}},{\text{\boldmath$ \pi $}})$ given $Y$ with ${\text{\boldmath$ \theta $}}={\text{\boldmath$ \theta $}}^{(t)}$ . Under the hierarchy (5), the marginal posterior distributions of $z$ and $\pi$ are not simple forms, but the full conditional distributions of ${\text{\boldmath$ z $}}|{\text{\boldmath$ \pi $}},Y$ and ${\text{\boldmath$ \pi $}}|{\text{\boldmath$ z $}},Y$ are the following familiar distributions:

[TABLE]

where $\widetilde{{\text{\boldmath$ p $}}}_{ij}=(\widetilde{p}_{ij1},\ldots,\widetilde{p}_{ijK})^{t}$ and $\widetilde{{\text{\boldmath$ a $}}}_{i}=(\widetilde{a}_{i1},\ldots,\widetilde{a}_{iK})^{t}$ with

[TABLE]

Then we can use a Gibbs sampler for generating random samples of the posterior distribution of $({\text{\boldmath$ z $}},{\text{\boldmath$ \pi $}})$ .

The M-step maximizes $Q({\text{\boldmath$ \theta $}}|{\text{\boldmath$ \theta $}}^{(t)})$ obtained from the E-step, noting that

[TABLE]

where $C$ is a constant independent of $\theta$ and $z_{ijk}^{\ast}={\rm E}[I(z_{ij}=k)|Y,{\text{\boldmath$ \theta $}}^{(t)}]$ computed from the E-step. Therefore, the maximization problem of $Q({\text{\boldmath$ \theta $}}|{\text{\boldmath$ \theta $}}^{(t)})$ can be divided into the following:

[TABLE]

where $(\log\pi_{ik})^{\ast}={\rm E}[\log\pi_{ik}|Y,{\text{\boldmath$ \theta $}}^{(t)}]$ . It is noted that the maximization with respect to each ${\text{\boldmath$ \phi $}}_{k}$ is identical to maximizing the weighted log-likelihood function of the latent conditional distributions, which can be easily carried out by using, for example, the Newton-Raphson algorithm. Similarly, the maximization with respect to $\alpha$ is similar to performing the maximum likelihood method in the Dirichlet distribution and is not difficult.

The whole procedure of the proposed MCEM algorithm is summarized as follows.

Algorithm 1 (MCEM algorithm).

Iterative:

Set the initial values ${\text{\boldmath$ \theta $}}^{(0)}$ and $t=0$ .

2.

Draw a large number of samples $\pi$ and $z$ by Gibbs sampling with the full conditionals (6), and compute $z_{ijk}^{\ast}={\rm E}[I(z_{ij}=k)|Y,{\text{\boldmath$ \theta $}}^{(t)}]$ and $(\log\pi_{ik})^{\ast}={\rm E}[\log\pi_{ik}|Y,{\text{\boldmath$ \theta $}}^{(t)}]$ .

3.

Solve the maximization problem (7) and set ${\text{\boldmath$ \phi $}}_{k}^{(t+1)}={\widehat{\text{\boldmath$ \phi $}}}_{k}$ and ${\text{\boldmath$ \alpha $}}^{(t+1)}={\widehat{\text{\boldmath$ \alpha $}}}$ .

4.

If the algorithm has converged, the the algorithm is terminated. Otherwise, set $t=t+1$ and go back to Step 2.

In the case of the normal linear regression model as the latent model, namely $F_{k}({\text{\boldmath$ x $}}_{ij},{\text{\boldmath$ \phi $}}_{k})=N({\text{\boldmath$ x $}}_{ij}^{t}{\text{\boldmath$ \beta $}}_{k},\sigma^{2}_{k})$ in (5), the M-step for ${\text{\boldmath$ \phi $}}_{k}=({\text{\boldmath$ \beta $}}_{k}^{t},\sigma_{k}^{2})^{t}$ in (7) can be obtained analytically:

[TABLE]

for $k=1,\ldots,K$ .

Following Shi and Copas (2002), the convergence of the proposed MCEM algorithm is monitored by using the batch mean $\widetilde{{\text{\boldmath$ \theta $}}}^{(t)}=H^{-1}\sum_{h=0}^{H-1}{\text{\boldmath$ \theta $}}^{(t-h)}$ , after the $H$ th iteration. The algorithm is terminated when the relative difference $\|\widetilde{{\text{\boldmath$ \theta $}}}^{(t)}-\widetilde{{\text{\boldmath$ \theta $}}}^{(t-d)}\|/(\|\widetilde{{\text{\boldmath$ \theta $}}}^{(t-d)}\|+\delta)$ , is smaller than some predetermined (small) ${\varepsilon}$ . Here, $H$ , $d$ , ${\varepsilon}$ and $\delta$ are specified by the user, and we use $H=30$ , $d=5$ , ${\varepsilon}=\delta=0.001$ as default choices. For the E-step, $L=500$ is used as the default choice and this choice appears to work well in the numerical examples in Section 3.

For selecting the number of latent distributions, $K$ , we use the Akaike information criteria (AIC) or the Bayesian information criteria (BIC) based on the log-marginal likelihood, without any theoretical justifications. When ${\text{\boldmath$ \phi $}}_{k}$ is $p$ -dimensional, the number of parameters included in the model (5) is $pK+K$ . Then the formulations of AIC and BIC are given by

[TABLE]

where $N=\sum_{i=1}^{m}n_{i}$ is the total number of observations and

[TABLE]

is the maximum marginal likelihood. As noted in Section 2.2, since the direct evaluation of the marginal likelihood is computationally prohibitive, the maximum marginal likelihood is evaluated by the Monte Carlo integration. Let ${\text{\boldmath$ \pi $}}_{i}^{*}=(\pi_{i1}^{*},\dots,\pi_{iG}^{*})^{t}$ be the random vector generated from ${\rm Dir}({\widehat{\text{\boldmath$ \alpha $}}})$ . Then, the Monte Carlo approximation of (8) is

[TABLE]

for a large $B$ , where $(\pi_{i1}^{*(b)},\dots,\pi_{iK}^{*(b)})^{t}$ is the $b$ th draw from ${\rm Dir}({\widehat{\text{\boldmath$ \alpha $}}})$ .

Let $K^{\ast}$ be the selected number of latent distributions based on AIC or BIC. Then the feasible version of the cluster-wise estimated conditional density (4) is given by

[TABLE]

where $\widehat{\pi}_{ik}={\rm E}[\pi_{ik}|Y_{i}]$ evaluated at ${\text{\boldmath$ \theta $}}={\widehat{\text{\boldmath$ \theta $}}}$ , which can be computed via the Gibbs sampler (6) with ${\text{\boldmath$ \theta $}}={\widehat{\text{\boldmath$ \theta $}}}$ .

Flexible modeling of mixing proportions

One possible criticism for the formulation of the proposed latent mixture model (1) is its simplicity in the relationship between the response variable $y$ and covariate vector $x$ . In the context of mixture modeling for non-clustered (independent) data, Geweke and Keane (2007) proposed a flexible modeling of the mixing proportions by considering covariate dependent structures. Then, we here consider implementing the idea to the modeling cluster-wise conditional densities, that is, we consider the following structure in the distribution of the mixing proportions:

[TABLE]

where ${\text{\boldmath$ w $}}_{i}$ is the $q$ -dimensional vector of the cluster-specific covariates and ${\text{\boldmath$ \gamma $}}_{k}$ is the corresponding coefficient. One can take, for example, ${\text{\boldmath$ w $}}_{i}=\bar{{\text{\boldmath$ x $}}}_{i}^{(s)}$ where $\bar{{\text{\boldmath$ x $}}}_{i}^{(s)}=n_{i}^{-1}\sum_{j=1}^{n_{i}}{\text{\boldmath$ x $}}_{ij}^{(s)}$ and ${\text{\boldmath$ x $}}_{ij}^{(s)}$ is the subvector of ${\text{\boldmath$ x $}}_{ij}$ . Under this setting, it hods that

[TABLE]

the MCEM algorithm developed in Section 2.2 can be easily modified to estimate the model with (9). Specifically, in the E-step $\widetilde{a}_{ik}$ appeared in the full conditional distribution of ${\text{\boldmath$ \pi $}}_{i}|{\text{\boldmath$ w $}},Y$ in (6) is replaced with

[TABLE]

and the M-step for $\alpha$ in (7) is replaced with the maximizing

[TABLE]

where ${\text{\boldmath$ \gamma $}}=\{{\text{\boldmath$ \gamma $}}_{1},\ldots,{\text{\boldmath$ \gamma $}}_{K}\}$ . Finally, it is noted that the number of parameters under (9) is $K(p+q)$ , so that the penalty terms in AIC and BIC used for selecting $K$ should be changed accordingly.

Numerical Studies

Simulation studies

The finite sample performance of the proposed latent mixture model is investigated together with some existing methods. We consider two cases of within-cluster sample sizes $n_{i}=30$ and $n_{i}=50$ for $i=1,\dots,m$ and $m=50$ . For the true conditional density in the $i$ th cluster, the following two scenarios are considered:

[TABLE]

where $i=1,\ldots,m$ , and $\phi(\cdot;a,b)$ denotes the density function of the normal distribution $N(a,b)$ and $x_{ij}\sim N(0,1)$ in each scenario. The latent mixture regression (LMR) model with normal linear regression models used as latent models is considered, and the number of latent components are selected by using BIC. For comparison, we also consider the local mixture (LM) model where the mixture of normal linear regressions is fitted to each cluster separately and global mixture (GM) model where the single mixture of normal linear regressions is fitted to the whole data ignoring the cluster heterogeneity. For both models, the number of components was selected based on BIC. Moreover, as the competitor from random effect models, we also applied a random intercept (RI) model. Note that GM ignores the clustering structure and produces the same conditional densities over all the clusters. On the other hand, while LM may flexibly express the cluster-wise conditional density, the results are expected to be unstable due to the relatively small within-cluster sample sizes.

The performance of the models is measured based on the cluster-wise mean integrated squared error (MISE) defined as

[TABLE]

where $\widehat{f}^{(r)}_{i}(t|x)$ is the estimated conditional density obtained from the $r$ th replication. Since the above MISE depends on the covariate $x$ , we considered the three values, $x=-1.5,-0.75,0$ . We computed the cluster-wise MISE of four models based on $R=100$ replications.

Figure 1 and 2 present the cluster-wise MISE for Scenario (I) and (II), respectively. The figures show that the proposed LMR model outperforms in all cases. As expected, LM appears to have produced the unstable results due to the relatively small sample sizes in spite of its flexibility. On the other hand, GM seems to perform relatively well in this study as the number of parameters is small compared with LM. However, since GM produces the same conditional density estimators over the clusters, GM performs no better than LMR. Concerning RC, it may perform as well as GM for $x=0$ in Scenario (I) and some cases in Scenario (II), but the result is much inferior to that of LMR. Although not shown here, BIC selected the true number of components most of the time, while the selected number of components by AIC tended to be larger than the truth. Hence, BIC would be preferable to AIC and only the results based on BIC are considered in the rest of this article.

We next investigate the efficacy of the modeling the distribution of the mixing proportion in terms of some covariates as introduced in Section 2.3. To this end, we consider the following true conditional density:

[TABLE]

We set $n_{i}=i$ for $i=1,\dots,m$ such that the within-cluster sample size varies across clusters and consider two cases of $m$ , $m=50$ and $80$ . As in the previous studies, the covariates $x_{ij}$ ’s are generated from $N(0,1)$ . The latent mixture regression model with covariate-dependent structure of mixing proportions (LMR-CD) and the latent mixture regression model (LMR) are fitted to the simulated data. For both models, we use the normal linear regression models as the component models, and the number of components is selected based on BIC. For comparison, we again computed the MISE with $x=-1.5,-0.75,0$ , and the results are presented in Figure 3. In the figure, LMR-CD appears to perform better than LMR for the clusters with the small within-cluster sample sizes for both $m$ .

Real data example

To demonstrate the proposed method in a practical situation, we apply the latent mixture model to the posted land price (PLP) data in Tokyo and the surrounding four prefectures (Chiba, Saitama, Kanagawa and Ibaraki) in 2001. The data units (locations) are clustered with respect to the nearest station. The number of clusters is $m=295$ and the total number of units is $N=2363$ . The number of within-cluster samples $n_{i}$ are ranging from $1$ to $45$ , and the histogram of $n_{i}$ is provided in the left panel in Figure 4. We note that there are $221$ clusters with $n_{i}$ smaller than $10$ and $25$ clusters with $n_{i}=1$ . The response variable $y_{ij}$ is the PLP which is measured in 100,000 yen per squared meter. In each $j$ th unit (location) in $i$ th cluster (station), $y_{ij}$ is observed with the floor area ratio (%) $F_{ij}$ and amount of time $A_{ij}$ (second) to station $i$ on foot. Moreover, as cluster level information, the amount of time $T_{i}$ from Tokyo station by train and the prefecture to which the station belongs are available. We use four dummy variables $D_{i1},D_{i2},D_{i3}$ , and $D_{i4}$ for Chiba, Saitama, Kanagawa, and Ibaraki, respectively, which take value one if the station $i$ belongs to the corresponding prefecture and zero otherwise. The values of $y_{ij}$ range from $0.158$ to $20.3$ . The right panel of Figure 4 shows that the histogram of $y_{ij}$ for $y_{ij}<8$ . Note that the number of samples with $y_{ij}\geq 8$ is only $20$ which is less than $1\%$ of the total number of observations. Using this dataset, the conditional density of the PLP for each station is estimated.

Let ${\text{\boldmath$ x $}}_{ij}=(1,F_{ij},A_{ij},T_{i},D_{i1},\ldots,D_{i4})^{t}$ . We consider the following latent mixture regression (LMR) model:

[TABLE]

where $\phi(\cdot;a,b)$ denotes the density function of $N(a,b)$ , and $T_{i}^{\ast}$ is the standardized version of $T_{i}$ . It is noted that the marginal model (3) is given by

[TABLE]

and the cluster-wise estimated density (4) is

[TABLE]

where $Y_{i}=\{y_{i1},\ldots,y_{in_{i}}\}$ and ${\rm E}[\pi_{ik}|Y_{i}]$ can be computed from the Gibbs sampling (6). Moreover, based on BIC, the number of latent components was selected to be $K=6$ from $\{1,\dots,8\}$ . We also doubled the number of Gibbs draws in the E-step, but the same result was obtained.

For comparison with the proposed method, we also applied the global mixture (GM) model with $K_{\ast}$ components:

[TABLE]

where $\sum_{k=1}^{K_{\ast}}p_{k}=1$ . It is expected that the estimated GM is similar to the marginal model in LMR. Based on BIC $K_{\ast}=5$ was selected.

To visualize the estimated conditional density in each cluster, we fixed the covariate vector $x$ at $(1,100,600,T_{i},D_{i1},\ldots,D_{i4})^{t}$ , in which $f_{i}(y|x)$ corresponds to the density function of the PLP of each cluster when the floor area ratio is $100$ and the location is $10$ minutes’ walk from the nearest station. Figure 5 presents the estimated density under LMR, the marginal model of LMR (mLMR), and GM for the stations with small $n_{i}$ . The figure shows that the cluster-wise estimated densities under LMR are close to those under the marginal model (11) when $n_{i}$ is small. This is because the small $n_{i}$ values leads to a small difference between the prior mean $p_{ik}$ and posterior mean ${\rm E}[\pi_{ik}|Y_{i}]$ of $\pi_{ik}$ , so that the estimated densities in such clusters are automatically close to those under the marginal model which can be stably estimated from the data. Figure 6 presents the estimated densities for the stations with relatively large $n_{i}$ . Contrary to Figure 5, the estimated densities under LMR are apart from the marginal model in some clusters. The result implies that the marginal model is adjusted by the observed data in these clusters. We finally point out that the marginal model of LMR and GM are similar in most cases since their modeling strategies are similar in the sense that they aim at estimating the global density by ignoring the clustering structure.

Conclusion and Discussion

We have proposed the latent mixture model for estimating the cluster-wise conditional distributions. The model parameters are estimated by using the simple Monte Carlo EM algorithm instead of the brute force maximization of the marginal likelihood. Through the simulation and empirical studies, the proposed method is found to be useful for flexible modeling of clustered data.

In this article, we selected the number of components by using AIC and BIC. However, it is well-recognized that the mixture model is a singular model and the use of AIC or BIC is not justified. The detailed investigation of selecting the number of latent components with theoretical validity would extend the scope of this article, which will be left as a valuable future work.

Acknowledgments. This work was supported by JSPS KAKENHI Grant Numbers [16H07406, 15K17036, 16K17101]. The computational results were obtained using Ox version 6.21 (Doornik, 2007).

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2[2] Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B , 61 , 265–285.
3[3]
4[4] Demidenko, E. (2004). Mixed Models: Theory and Applications , New York: Wiley.
5[5]
6[6] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum Likelihood From Incomplete Data via the EM Algorithm (with discussion). Journal of the Royal Statistical Society: Series B , 39 , 1–38.
7[7]
8[8] Doornik, J. (2007). Ox: Object Oriented Matrix Programming , Timberlake Consultants Press: London.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

**Latent Mixture Modeling for Clustered Data111This version: **

Introduction

Latent Mixture Model

Model setup

Monte Carlo EM algorithm for parameter estimation

Algorithm 1** (MCEM algorithm).**

Flexible modeling of mixing proportions

Numerical Studies

Simulation studies

Real data example

Conclusion and Discussion

Latent Mixture Modeling for Clustered Data111This version:

Algorithm 1 (MCEM algorithm).