Latent Mixture Modeling for Clustered Data
Shonosuke Sugasawa, Genya Kobayashi, Yuki Kawakubo

TL;DR
This paper introduces a novel mixture modeling approach for clustered data, utilizing latent experts and Dirichlet-distributed mixing proportions, estimated via a Monte Carlo EM algorithm, with extensions for covariate dependence.
Contribution
It develops a new mixture-of-experts model for clustered data with a Monte Carlo EM estimation method and covariate-dependent mixing proportions.
Findings
The proposed model outperforms existing mixture models in simulations.
It effectively captures cluster-specific distributions.
Application to Japanese land price data demonstrates practical utility.
Abstract
This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Statistical Methods and Inference
**Latent Mixture Modeling for Clustered Data111This version: **
SHONOSUKE SUGASAWA
*Risk Analysis Research Center, The Institute of Statistical Mathematics
*GENYA KOBAYASHI
*Graduate School of Social Sciences, Chiba University
*YUKI KAWAKUBO
Graduate School of Social Sciences, Chiba University
Abstract. This article proposes a mixture modeling approach to estimating cluster-wise conditional distributions in clustered (grouped) data. We adapt the mixture-of-experts model to the latent distributions, and propose a model in which each cluster-wise density is represented as a mixture of latent experts with cluster-wise mixing proportions distributed as Dirichlet distribution. The model parameters are estimated by maximizing the marginal likelihood function using a newly developed Monte Carlo Expectation-Maximization algorithm. We also extend the model such that the distribution of cluster-wise mixing proportions depends on some cluster-level covariates. The finite sample performance of the proposed model is compared with some existing mixture modeling approaches as well as linear mixed model through the simulation studies. The proposed model is also illustrated with the posted land price data in Japan.
Key words: conditional distribution; Monte Carlo EM algorithm; hierarchical model; mixture modeling; random effect
Introduction
Grouped or clustered data often arise in many scientific fields such as econometrics, epidemiology, and genetics. Although the mixed-effects model (Demidenko, 2004) has been widely used for such data, it fundamentally aims at modeling conditional means in each cluster, which could be inappropriate if the data distribution is skewed or multimodal. As an alternative modeling strategy, the finite mixture model (McLachlan and Peel, 2000) has been extensively applied for its flexibility to capture the within-cluster heterogeneity in the data. For modeling independent data, the mixture model with covariates was originally proposed in Jacob et al. (1991), known as mixture-of-experts. To date, a large body of literature has been concerned with flexible modeling of the conditional density for independent data. For example, see Jordan and Jacobs (1994), Hurn et al. (2003), Geweke and Keane (2007), Villani et al. (2009), Villani et al. (2012) and Nguyen and McLachlan (2016).
However, the existing models for independent data are not suitable for estimating cluster-wise conditional distributions. If we globally apply the mixture models to a whole dataset ignoring the clustering labels (we call global mixture modeling), the estimated conditional distributions are the same over all clusters, which is clearly inappropriate in clustered data analysis. On the other hand, applying the mixture models independently to each cluster in order to capture the cluster heterogeneity (we call local mixture modeling) leads to unstable results since the within-cluster samples sizes are usually not large in practice. Hence, another flexible modeling strategy for clustered data is desired. Up to now, several methods have been proposed for modeling cluster-wise distributions. Rubin and Wu (1997) proposed a mixture of linear mixed-effects models. Sun et al. (2007) developed a mixture of linear models with the random effects used in the generalized linear model for the mixing proportions. Rosen et al. (2000) and Tang and Qu (2016) used the generalized estimating equation approach to estimate the component distributions by incorporating the correlations within clusters.
In this article, we propose a compromised model between the global and local mixture modeling. Note that the local mixture model can be expressed as
[TABLE]
where is the response variable, is the vector of covariates, and is the component distribution for the th component of the th cluster with the mixing proportion satisfying . Since the within-cluster sample size is usually small in practice, h_{ik}(y|{\text{\boldmathx}}) would not be stably estimated. Hence, we restrict h_{ik}(y|{\text{\boldmathx}})=h_{k}(y|{\text{\boldmathx}}), that is, the component distributions are the same over all the clusters like global modeling. Then the model reduces to
[TABLE]
which can be interpreted as there exists latent distributions and each cluster-wise distribution f_{i}(y|{\text{\boldmathx}}) is expressed by these distributions with cluster-wise mixing proportions . Hence, as long as is a moderate number, one can estimate component distributions with reasonable accuracy. On the other hand, estimating unstructured is not feasible since the number of ’s grows as the number of clusters increases. To overcome this difficulty, we assume that the vector of proportions {\text{\boldmath\pi}}_{i}=(\pi_{i1},\ldots,\pi_{iK})^{t} that characterizes the conditional distribution of the th cluster, is a realization from a multivariate distribution. Therefore, {\text{\boldmath\pi}}_{i} plays a similar role to the random effect in the context of the mixed-effects model. As a distribution of {\text{\boldmath\pi}}_{i}, we use the Dirichlet distribution, which allows us to develop a tractable estimating method for model parameters.
In this article, the model parameters are estimated based on a likelihood-based approach. The model can be viewed as a three-stage hierarchical model, where the first stage consists of the model for the response variable, the second stage consists of the latent variables which assign the latent distribution, and the third stage consists of the model for the mixing proportions. We develop a Monte Carlo Expectation-Maximization (MCEM) algorithm (Dempster et al., 1977; Wei and Tanner, 1990) for parameter estimation of which the E-step is consist of a simple Gibbs sampling scheme for imputing the latent variables. Since the number of latent distributions is generally unknown, we consider selecting based on the Akaike information criteria (AIC) or Bayesian information criteria (BIC), where the maximum log-marginal likelihood can be easily computed from a simple Monte Carlo approximation.
The rest of the paper is organized as follows: Section 2 describes the proposed model in detail and develops the MCEM algorithm for maximizing the marginal likelihood. In Section 3, the performance of the proposed method is demonstrated along with some existing methods through simulation studies. An application to the real data set is also presented. In Section 4, some discussion is provided.
Latent Mixture Model
Model setup
Suppose that we have the clustered (grouped) observations , , , with an associated -dimensional vector of covariates {\text{\boldmathx}}_{ij}. Let f_{i}(y|{\text{\boldmathx}}) be a density or probability mass function of given {\text{\boldmathx}}_{ij}, which are the same within clusters but different across clusters. Our aim is to estimate the cluster-wise conditional density f_{i}(y|{\text{\boldmathx}}) from the data set \{y_{ij},{\text{\boldmathx}}_{ij}\}. To this end, we consider the following latent mixture model:
[TABLE]
where is the weight for the th component in the th cluster, h_{k}(\cdot|\cdot,{\text{\boldmath\phi}}_{k}),\ k=1,\ldots,K are the latent conditional densities characterized by the parameter {\text{\boldmath\phi}}_{k}, and is the unknown number of latent densities. Moreover, we assume that the mixing proportions {\text{\boldmath\pi}}_{i}’s are independent realizations from the Dirichlet distribution with the density
[TABLE]
for , where denotes the gamma function and {\text{\boldmath\alpha}}=({\alpha}_{1},\ldots,{\alpha}_{K})^{t} is a vector of unknown parameters. In this article, we let (1) and (2) together denote the latent mixture model. The unknown model parameters to be estimated are {\text{\boldmath\phi}}_{1},\ldots,{\text{\boldmath\phi}}_{K} in latent distributions and in the Dirichlet distribution. Under the setting (1) and (2), taking expectation of with respect to Dir({\text{\boldmath\alpha}}), we have
[TABLE]
which is referred to the marginal model, and is common over all the clusters. Hence, we can observe that {\text{\boldmath\pi}}_{i} characterizes the cluster-wise conditional density and plays a similar role to the random effects in the context of mixed-effects models. The mixing proportion {\text{\boldmath\pi}}_{i} can be estimated by the conditional expectation {\rm E}[{\text{\boldmath\pi}}_{i}|Y], where is a set of all the response variables. Under (1) and (2), response variables in different clusters are mutually independent, so that it holds {\rm E}[{\text{\boldmath\pi}}_{i}|Y]={\rm E}[{\text{\boldmath\pi}}_{i}|Y_{i}] with . Then, if the model parameters are known, the estimator of the cluster-wise conditional density is given by
[TABLE]
Generally speaking, the conditional expectation tends close to the marginal mean if the cluster-specific sample size is small, so that the estimated conditional density would be close to the marginal model (3). On the other hand, in clusters with relatively large , the estimated conditional density might vary from the marginal model (3), depending on the information of . Therefore, this model allows us to carry out a kind of shrinkage estimation of the cluster-wise conditional densities.
As often done in estimating mixture models, by introducing the latent component indicator , the proposed model (1) and (2) can be expressed in the three-stage hierarchical model:
[TABLE]
where is the distribution having density , and \text{Cat}(K,{\text{\boldmath\pi}}_{i}) is the categorical distribution on with the probability vector {\text{\boldmath\pi}}_{i}. In hierarchy (5), {\text{\boldmathz}}_{ij} and {\text{\boldmath\pi}}_{i} are the latent variables. The latent density is determined by the user and the generalized linear model is an attractive choice. For example, F_{k}({\text{\boldmathx}}_{ij},{\text{\boldmath\phi}}_{k})=N({\text{\boldmathx}}_{ij}^{t}{\text{\boldmath\beta}}_{k},\sigma_{k}^{2}) when is a continuous variable, and F_{k}({\text{\boldmathx}}_{ij},{\text{\boldmath\phi}}_{k})=\text{Po}(\exp({\text{\boldmathx}}^{t}{\text{\boldmath\beta}}_{k})) when is a counting variable.
Monte Carlo EM algorithm for parameter estimation
For completion of the conditional density (4), we need to estimate the unknown model parameters {\text{\boldmath\theta}}=\{{\text{\boldmath\phi}}_{1},\ldots,{\text{\boldmath\phi}}_{K},{\text{\boldmath\alpha}}\} based on the data. Under the hierarchical formulation (5), the marginal likelihood function L({\text{\boldmath\theta}}) is expressed as
[TABLE]
where and \sum_{{\text{\boldmathz}}_{i}} denotes the summation over the all combination of {\text{\boldmathz}}_{i}\in\{1,\ldots,K\}^{n_{i}}. Hence, a direct maximization of the marginal likelihood is not feasible since evaluation of the likelihood function L({\text{\boldmath\theta}}) requires the summation over elements for each , which is computationally prohibitive even for small . Moreover, since the functional form of L({\text{\boldmath\theta}}) is complex and not familiar, the brute force maximization of L({\text{\boldmath\theta}}) is not realistic.
Instead, we exploit the hierarchical representation (5) and develop the EM algorithm (Dempster et al., 1977) which indirectly and iteratively maximizes L({\text{\boldmath\theta}}). Let {\text{\boldmath\pi}}=\left\{{\text{\boldmath\pi}}_{1},\ldots,{\text{\boldmath\pi}}_{m}\right\} and {\text{\boldmathz}}=\left\{{\text{\boldmathz}}_{1},\ldots,{\text{\boldmathz}}_{m}\right\}. Then, the complete log-likelihood function of (5) is given by
[TABLE]
where p({\text{\boldmath\pi}}_{i}|{\text{\boldmath\alpha}}) denotes the density function of {\rm Dir}({\text{\boldmath\alpha}}). Then, given the value of in the th iteration denoted by {\text{\boldmath\theta}}^{(t)}, the E-step entails the imputation of the latent variables and by taking expectation
[TABLE]
where the expectation is taken with respect to the posterior distribution of ({\text{\boldmathw}},{\text{\boldmath\pi}}) given all the response variables . However, since an analytical form of Q({\text{\boldmath\theta}}|{\text{\boldmath\theta}}^{(t)}) is not available, we consider Monte Carlo approximation of Q({\text{\boldmath\theta}}|{\text{\boldmath\theta}}^{(t)}) as
[TABLE]
where is a sufficiently large number, and {\text{\boldmathz}}^{(l)} and {\text{\boldmath\pi}}^{(l)} are the th random sample generated from the posterior distribution of ({\text{\boldmathz}},{\text{\boldmath\pi}}) given with {\text{\boldmath\theta}}={\text{\boldmath\theta}}^{(t)}. Under the hierarchy (5), the marginal posterior distributions of and are not simple forms, but the full conditional distributions of {\text{\boldmathz}}|{\text{\boldmath\pi}},Y and {\text{\boldmath\pi}}|{\text{\boldmathz}},Y are the following familiar distributions:
[TABLE]
where \widetilde{{\text{\boldmathp}}}_{ij}=(\widetilde{p}_{ij1},\ldots,\widetilde{p}_{ijK})^{t} and \widetilde{{\text{\boldmatha}}}_{i}=(\widetilde{a}_{i1},\ldots,\widetilde{a}_{iK})^{t} with
[TABLE]
Then we can use a Gibbs sampler for generating random samples of the posterior distribution of ({\text{\boldmathz}},{\text{\boldmath\pi}}).
The M-step maximizes Q({\text{\boldmath\theta}}|{\text{\boldmath\theta}}^{(t)}) obtained from the E-step, noting that
[TABLE]
where is a constant independent of and z_{ijk}^{\ast}={\rm E}[I(z_{ij}=k)|Y,{\text{\boldmath\theta}}^{(t)}] computed from the E-step. Therefore, the maximization problem of Q({\text{\boldmath\theta}}|{\text{\boldmath\theta}}^{(t)}) can be divided into the following:
[TABLE]
where (\log\pi_{ik})^{\ast}={\rm E}[\log\pi_{ik}|Y,{\text{\boldmath\theta}}^{(t)}]. It is noted that the maximization with respect to each {\text{\boldmath\phi}}_{k} is identical to maximizing the weighted log-likelihood function of the latent conditional distributions, which can be easily carried out by using, for example, the Newton-Raphson algorithm. Similarly, the maximization with respect to is similar to performing the maximum likelihood method in the Dirichlet distribution and is not difficult.
The whole procedure of the proposed MCEM algorithm is summarized as follows.
Algorithm 1** (MCEM algorithm).**
Iterative:
Set the initial values {\text{\boldmath\theta}}^{(0)} and .
- 2.
Draw a large number of samples and by Gibbs sampling with the full conditionals (6), and compute z_{ijk}^{\ast}={\rm E}[I(z_{ij}=k)|Y,{\text{\boldmath\theta}}^{(t)}] and (\log\pi_{ik})^{\ast}={\rm E}[\log\pi_{ik}|Y,{\text{\boldmath\theta}}^{(t)}].
- 3.
Solve the maximization problem (7) and set {\text{\boldmath\phi}}_{k}^{(t+1)}={\widehat{\text{\boldmath\phi}}}_{k} and {\text{\boldmath\alpha}}^{(t+1)}={\widehat{\text{\boldmath\alpha}}}.
- 4.
If the algorithm has converged, the the algorithm is terminated. Otherwise, set and go back to Step 2.
In the case of the normal linear regression model as the latent model, namely F_{k}({\text{\boldmathx}}_{ij},{\text{\boldmath\phi}}_{k})=N({\text{\boldmathx}}_{ij}^{t}{\text{\boldmath\beta}}_{k},\sigma^{2}_{k}) in (5), the M-step for {\text{\boldmath\phi}}_{k}=({\text{\boldmath\beta}}_{k}^{t},\sigma_{k}^{2})^{t} in (7) can be obtained analytically:
[TABLE]
for .
Following Shi and Copas (2002), the convergence of the proposed MCEM algorithm is monitored by using the batch mean \widetilde{{\text{\boldmath\theta}}}^{(t)}=H^{-1}\sum_{h=0}^{H-1}{\text{\boldmath\theta}}^{(t-h)}, after the th iteration. The algorithm is terminated when the relative difference \|\widetilde{{\text{\boldmath\theta}}}^{(t)}-\widetilde{{\text{\boldmath\theta}}}^{(t-d)}\|/(\|\widetilde{{\text{\boldmath\theta}}}^{(t-d)}\|+\delta), is smaller than some predetermined (small) . Here, , , and are specified by the user, and we use , , as default choices. For the E-step, is used as the default choice and this choice appears to work well in the numerical examples in Section 3.
For selecting the number of latent distributions, , we use the Akaike information criteria (AIC) or the Bayesian information criteria (BIC) based on the log-marginal likelihood, without any theoretical justifications. When {\text{\boldmath\phi}}_{k} is -dimensional, the number of parameters included in the model (5) is . Then the formulations of AIC and BIC are given by
[TABLE]
where is the total number of observations and
[TABLE]
is the maximum marginal likelihood. As noted in Section 2.2, since the direct evaluation of the marginal likelihood is computationally prohibitive, the maximum marginal likelihood is evaluated by the Monte Carlo integration. Let {\text{\boldmath\pi}}_{i}^{*}=(\pi_{i1}^{*},\dots,\pi_{iG}^{*})^{t} be the random vector generated from {\rm Dir}({\widehat{\text{\boldmath\alpha}}}). Then, the Monte Carlo approximation of (8) is
[TABLE]
for a large , where is the th draw from {\rm Dir}({\widehat{\text{\boldmath\alpha}}}).
Let be the selected number of latent distributions based on AIC or BIC. Then the feasible version of the cluster-wise estimated conditional density (4) is given by
[TABLE]
where evaluated at {\text{\boldmath\theta}}={\widehat{\text{\boldmath\theta}}}, which can be computed via the Gibbs sampler (6) with {\text{\boldmath\theta}}={\widehat{\text{\boldmath\theta}}}.
Flexible modeling of mixing proportions
One possible criticism for the formulation of the proposed latent mixture model (1) is its simplicity in the relationship between the response variable and covariate vector . In the context of mixture modeling for non-clustered (independent) data, Geweke and Keane (2007) proposed a flexible modeling of the mixing proportions by considering covariate dependent structures. Then, we here consider implementing the idea to the modeling cluster-wise conditional densities, that is, we consider the following structure in the distribution of the mixing proportions:
[TABLE]
where {\text{\boldmathw}}_{i} is the -dimensional vector of the cluster-specific covariates and {\text{\boldmath\gamma}}_{k} is the corresponding coefficient. One can take, for example, {\text{\boldmathw}}_{i}=\bar{{\text{\boldmathx}}}_{i}^{(s)} where \bar{{\text{\boldmathx}}}_{i}^{(s)}=n_{i}^{-1}\sum_{j=1}^{n_{i}}{\text{\boldmathx}}_{ij}^{(s)} and {\text{\boldmathx}}_{ij}^{(s)} is the subvector of {\text{\boldmathx}}_{ij}. Under this setting, it hods that
[TABLE]
the MCEM algorithm developed in Section 2.2 can be easily modified to estimate the model with (9). Specifically, in the E-step appeared in the full conditional distribution of {\text{\boldmath\pi}}_{i}|{\text{\boldmathw}},Y in (6) is replaced with
[TABLE]
and the M-step for in (7) is replaced with the maximizing
[TABLE]
where {\text{\boldmath\gamma}}=\{{\text{\boldmath\gamma}}_{1},\ldots,{\text{\boldmath\gamma}}_{K}\}. Finally, it is noted that the number of parameters under (9) is , so that the penalty terms in AIC and BIC used for selecting should be changed accordingly.
Numerical Studies
Simulation studies
The finite sample performance of the proposed latent mixture model is investigated together with some existing methods. We consider two cases of within-cluster sample sizes and for and . For the true conditional density in the th cluster, the following two scenarios are considered:
[TABLE]
where , and denotes the density function of the normal distribution and in each scenario. The latent mixture regression (LMR) model with normal linear regression models used as latent models is considered, and the number of latent components are selected by using BIC. For comparison, we also consider the local mixture (LM) model where the mixture of normal linear regressions is fitted to each cluster separately and global mixture (GM) model where the single mixture of normal linear regressions is fitted to the whole data ignoring the cluster heterogeneity. For both models, the number of components was selected based on BIC. Moreover, as the competitor from random effect models, we also applied a random intercept (RI) model. Note that GM ignores the clustering structure and produces the same conditional densities over all the clusters. On the other hand, while LM may flexibly express the cluster-wise conditional density, the results are expected to be unstable due to the relatively small within-cluster sample sizes.
The performance of the models is measured based on the cluster-wise mean integrated squared error (MISE) defined as
[TABLE]
where is the estimated conditional density obtained from the th replication. Since the above MISE depends on the covariate , we considered the three values, . We computed the cluster-wise MISE of four models based on replications.
Figure 1 and 2 present the cluster-wise MISE for Scenario (I) and (II), respectively. The figures show that the proposed LMR model outperforms in all cases. As expected, LM appears to have produced the unstable results due to the relatively small sample sizes in spite of its flexibility. On the other hand, GM seems to perform relatively well in this study as the number of parameters is small compared with LM. However, since GM produces the same conditional density estimators over the clusters, GM performs no better than LMR. Concerning RC, it may perform as well as GM for in Scenario (I) and some cases in Scenario (II), but the result is much inferior to that of LMR. Although not shown here, BIC selected the true number of components most of the time, while the selected number of components by AIC tended to be larger than the truth. Hence, BIC would be preferable to AIC and only the results based on BIC are considered in the rest of this article.
We next investigate the efficacy of the modeling the distribution of the mixing proportion in terms of some covariates as introduced in Section 2.3. To this end, we consider the following true conditional density:
[TABLE]
We set for such that the within-cluster sample size varies across clusters and consider two cases of , and . As in the previous studies, the covariates ’s are generated from . The latent mixture regression model with covariate-dependent structure of mixing proportions (LMR-CD) and the latent mixture regression model (LMR) are fitted to the simulated data. For both models, we use the normal linear regression models as the component models, and the number of components is selected based on BIC. For comparison, we again computed the MISE with , and the results are presented in Figure 3. In the figure, LMR-CD appears to perform better than LMR for the clusters with the small within-cluster sample sizes for both .
Real data example
To demonstrate the proposed method in a practical situation, we apply the latent mixture model to the posted land price (PLP) data in Tokyo and the surrounding four prefectures (Chiba, Saitama, Kanagawa and Ibaraki) in 2001. The data units (locations) are clustered with respect to the nearest station. The number of clusters is and the total number of units is . The number of within-cluster samples are ranging from to , and the histogram of is provided in the left panel in Figure 4. We note that there are clusters with smaller than and clusters with . The response variable is the PLP which is measured in 100,000 yen per squared meter. In each th unit (location) in th cluster (station), is observed with the floor area ratio (%) and amount of time (second) to station on foot. Moreover, as cluster level information, the amount of time from Tokyo station by train and the prefecture to which the station belongs are available. We use four dummy variables , and for Chiba, Saitama, Kanagawa, and Ibaraki, respectively, which take value one if the station belongs to the corresponding prefecture and zero otherwise. The values of range from to . The right panel of Figure 4 shows that the histogram of for . Note that the number of samples with is only which is less than of the total number of observations. Using this dataset, the conditional density of the PLP for each station is estimated.
Let {\text{\boldmathx}}_{ij}=(1,F_{ij},A_{ij},T_{i},D_{i1},\ldots,D_{i4})^{t}. We consider the following latent mixture regression (LMR) model:
[TABLE]
where denotes the density function of , and is the standardized version of . It is noted that the marginal model (3) is given by
[TABLE]
and the cluster-wise estimated density (4) is
[TABLE]
where and can be computed from the Gibbs sampling (6). Moreover, based on BIC, the number of latent components was selected to be from . We also doubled the number of Gibbs draws in the E-step, but the same result was obtained.
For comparison with the proposed method, we also applied the global mixture (GM) model with components:
[TABLE]
where . It is expected that the estimated GM is similar to the marginal model in LMR. Based on BIC was selected.
To visualize the estimated conditional density in each cluster, we fixed the covariate vector at , in which corresponds to the density function of the PLP of each cluster when the floor area ratio is and the location is minutes’ walk from the nearest station. Figure 5 presents the estimated density under LMR, the marginal model of LMR (mLMR), and GM for the stations with small . The figure shows that the cluster-wise estimated densities under LMR are close to those under the marginal model (11) when is small. This is because the small values leads to a small difference between the prior mean and posterior mean of , so that the estimated densities in such clusters are automatically close to those under the marginal model which can be stably estimated from the data. Figure 6 presents the estimated densities for the stations with relatively large . Contrary to Figure 5, the estimated densities under LMR are apart from the marginal model in some clusters. The result implies that the marginal model is adjusted by the observed data in these clusters. We finally point out that the marginal model of LMR and GM are similar in most cases since their modeling strategies are similar in the sense that they aim at estimating the global density by ignoring the clustering structure.
Conclusion and Discussion
We have proposed the latent mixture model for estimating the cluster-wise conditional distributions. The model parameters are estimated by using the simple Monte Carlo EM algorithm instead of the brute force maximization of the marginal likelihood. Through the simulation and empirical studies, the proposed method is found to be useful for flexible modeling of clustered data.
In this article, we selected the number of components by using AIC and BIC. However, it is well-recognized that the mixture model is a singular model and the use of AIC or BIC is not justified. The detailed investigation of selecting the number of latent components with theoretical validity would extend the scope of this article, which will be left as a valuable future work.
Acknowledgments. This work was supported by JSPS KAKENHI Grant Numbers [16H07406, 15K17036, 16K17101]. The computational results were obtained using Ox version 6.21 (Doornik, 2007).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1]
- 2[2] Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B , 61 , 265–285.
- 3[3]
- 4[4] Demidenko, E. (2004). Mixed Models: Theory and Applications , New York: Wiley.
- 5[5]
- 6[6] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum Likelihood From Incomplete Data via the EM Algorithm (with discussion). Journal of the Royal Statistical Society: Series B , 39 , 1–38.
- 7[7]
- 8[8] Doornik, J. (2007). Ox: Object Oriented Matrix Programming , Timberlake Consultants Press: London.
