Latent Gaussian Mixture Models for Nationwide Kidney Transplant Center   Evaluation

Lanfeng Pan; Yehua Li; Kevin He; Yanming Li; Yi Li

arXiv:1703.03753·stat.AP·March 13, 2017

Latent Gaussian Mixture Models for Nationwide Kidney Transplant Center Evaluation

Lanfeng Pan, Yehua Li, Kevin He, Yanming Li, Yi Li

PDF

Open Access

TL;DR

This paper introduces a novel latent Gaussian mixture model to evaluate kidney transplant centers nationwide, improving the assessment of center effects by capturing heterogeneity and addressing distributional assumptions.

Contribution

It proposes a penalized likelihood estimation method for latent Gaussian mixture models and develops tests to determine the number of mixture components, enhancing center effect analysis.

Findings

01

Distributional assumptions affect center effect predictions.

02

The mixture model effectively captures heterogeneity among centers.

03

Simulation and real data validate the proposed approach.

Abstract

Five year post-transplant survival rate is an important indicator on quality of care delivered by kidney transplant centers in the United States. To provide a fair assessment of each transplant center, an effect that represents the center-specific care quality, along with patient level risk factors, is often included in the risk adjustment model. In the past, the center effects have been modeled as either fixed effects or Gaussian random effects, with various pros and cons. Our numerical analyses reveal that the distributional assumptions do impact the prediction of center effects especially when the effect is extreme. To bridge the gap between these two approaches, we propose to model the transplant center effect as a latent random variable with a finite Gaussian mixture distribution. Such latent Gaussian mixture models provide a convenient framework to study the heterogeneity among…

Figures22

Click any figure to enlarge with its caption.

Tables5

Table 1. Table 1: Summary for parameter estimation under Simulation Model 1 based on 200 replications.

	Truth	Mean	Bias	Std
$π_{1}$	0.5000	0.4971	-0.0029	0.0280
$π_{2}$	0.5000	0.5029	0.0029	0.0280
$μ_{1}$	-3.2598	-3.2586	0.0012	0.1262
$μ_{2}$	0.7402	0.7401	-0.0001	0.0752
$σ_{1}$	1.2000	1.1954	-0.0046	0.1340
$σ_{2}$	0.8000	0.7960	-0.0040	0.0630
$β_{1}$	1.0000	1.0017	0.0017	0.0213
$β_{2}$	1.0000	1.0006	0.0006	0.0225

Table 2. Table 2: Summary for parameter estimation under Simulation Model 2 based on 200 replications.

	Truth	Mean	Bias	Std
$π_{1}$	0.3000	0.3016	0.0016	0.0244
$π_{2}$	0.4000	0.3904	-0.0096	0.0588
$π_{3}$	0.3000	0.3080	0.0080	0.0596
$μ_{1}$	-5.2598	-5.2800	-0.0202	0.2175
$μ_{2}$	-0.2598	-0.2652	-0.0054	0.3472
$μ_{3}$	2.7402	2.6894	-0.0508	0.3433
$σ_{1}$	1.2000	1.1821	-0.0179	0.2664
$σ_{2}$	0.8000	0.8036	0.0036	0.1948
$σ_{3}$	0.9000	0.9286	0.0286	0.2516
$β_{1}$	1.0000	1.0010	0.0010	0.0225
$β_{2}$	1.0000	1.0038	0.0038	0.0226

Table 3. Table 3: Mean squared prediction error for the random effect under Simulation Models 1 and 2. GLMM Gaussian: generalized linear mixed model with Gaussian random effects; GLMM Mixture: the proposed model; Mean: Mean Squared Prediction Error averaged over 200 replicates; Std: standard deviation of the prediction error.

Simulation Model	Fitted Model	Mean	Std
Model 1	GLMM Gaussian	0.4167	0.0392
	GLMM Mixture	0.3589	0.0361
Model 2	GLMM Gaussian	0.6988	0.0697
	GLMM Mixture	0.5405	0.0581

Table 4. Table 4: OPTN data analysis: estimated GLMM regression coefficients, standard errors, z 𝑧 z -values and p 𝑝 p -values. The covariates are x 1 subscript 𝑥 1 x_{1} =cold ischemic time, x 2 = subscript 𝑥 2 absent x_{2}= age, x 3 = subscript 𝑥 3 absent x_{3}= sex; x 4 subscript 𝑥 4 x_{4} – x 6 subscript 𝑥 6 x_{6} are indicators for BMI in the intervals (22, 25], (25-30] and 30+ respectively; x 7 subscript 𝑥 7 x_{7} – x 10 subscript 𝑥 10 x_{10} are indicators for cases performed in 1990-1994, 1995–1999, 2000–2003 and 2004–2008 respectively.

	Estimate	Std. Error	$z$ -value	$p$ -value
$x_{1}$	0.019503	0.0003048	63.9869	$<$ 1e-99
$x_{2}$	0.007112	0.0002117	33.5890	$<$ 1e-99
$x_{3}$	0.030928	0.0094616	3.2688	0.0011
$x_{4}$	0.077860	0.0154998	5.0232	$<$ 1e-6
$x_{5}$	0.120536	0.0129628	9.2986	$<$ 1e-19
$x_{6}$	0.225015	0.0148196	15.1836	$<$ 1e-51
$x_{7}$	-0.270078	0.0146769	-18.4016	$<$ 1e-74
$x_{8}$	-0.526297	0.0127432	-41.3003	$<$ 1e-99
$x_{9}$	-0.632073	0.0138511	-45.6334	$<$ 1e-99
$x_{10}$	-0.800276	0.0130163	-61.4824	$<$ 1e-99

Table 5. Table 5: The out-performing centers detected using local false discovery rate in the OPTN data.

Center id	lFDR	$\hat{γ}$	Sample Size	Survival Rate
#287	0.0013	-2.6784	114	0.973
#10	0.0061	-2.5753	125	0.944
#28	0.0736	-2.3364	120	0.841

Equations180

\displaystyle f(Y_{ik}|{\boldsymbol{X}}_{ik},\gamma_{i};{\boldsymbol{\beta}},\varphi)=\exp\bigg{\{}\frac{Y_{ik}\xi_{ik}+b(\xi_{ik})}{a(\varphi)}+d(Y_{ik},\varphi)\bigg{\}},

\displaystyle f(Y_{ik}|{\boldsymbol{X}}_{ik},\gamma_{i};{\boldsymbol{\beta}},\varphi)=\exp\bigg{\{}\frac{Y_{ik}\xi_{ik}+b(\xi_{ik})}{a(\varphi)}+d(Y_{ik},\varphi)\bigg{\}},

l_{co m p} (θ; Y, X, γ, L) = i = 1 \sum n ℓ_{i, co m p} (θ; Y_{i}, X_{i}, γ_{i}, L_{i}),

l_{co m p} (θ; Y, X, γ, L) = i = 1 \sum n ℓ_{i, co m p} (θ; Y_{i}, X_{i}, γ_{i}, L_{i}),

l_{co m p, p} (θ; Y, X, γ, L) = l_{co m p} (θ; Y, X, γ, L) + c = 1 \sum C p_{n} (σ_{c}^{2}),

l_{co m p, p} (θ; Y, X, γ, L) = l_{co m p} (θ; Y, X, γ, L) + c = 1 \sum C p_{n} (σ_{c}^{2}),

p_{n} (σ^{2}; σ_{p i l o t}^{2}) = - a_{n} {σ_{p i l o t}^{2} / σ^{2} + log (σ^{2} / σ_{p i l o t}^{2}) - 1},

p_{n} (σ^{2}; σ_{p i l o t}^{2}) = - a_{n} {σ_{p i l o t}^{2} / σ^{2} + log (σ^{2} / σ_{p i l o t}^{2}) - 1},

Q (θ ∣ θ^{(t - 1)})

Q (θ ∣ θ^{(t - 1)})

E [ℓ_{i, co m p} (θ; Y_{i}, X_{i}, γ_{i}, L_{i}) ∣ Y_{i}, X_{i}, θ^{(t - 1)}]

E [ℓ_{i, co m p} (θ; Y_{i}, X_{i}, γ_{i}, L_{i}) ∣ Y_{i}, X_{i}, θ^{(t - 1)}]

= c = 1 \sum C \int log f (Y_{i} ∣ X_{i}, γ; θ_{y}) f (γ, L_{i c} = 1∣ X_{i}, Y_{i}; θ^{(t - 1)}) d γ

+ c = 1 \sum C \int [log f_{c} (γ ∣ μ_{c}, σ_{c}) f (γ, L_{i c} = 1∣ X_{i}, Y_{i}; θ^{(t - 1)})] d γ

+ c = 1 \sum C log π_{c} \int f (γ, L_{i c} = 1∣ X_{i}, Y_{i}; θ^{(t - 1)}) d γ,

f (γ, L_{i c} = 1∣ X_{i}, Y_{i}; θ^{(t - 1)}) = \frac{π _{c}^{(t - 1)} f ( Y _{i} ∣ X _{i} , γ ; θ _{y}^{(t - 1)} ) \frac{1}{σ _{c}^{(t - 1)}} ϕ ( \frac{γ - μ _{c}^{(t - 1)}}{σ _{c}^{(t - 1)}} )}{\sum _{c = 1}^{C} π _{c}^{(t - 1)} \int f ( Y _{i} ∣ X _{i} , γ ; θ _{y}^{(t - 1)} ) \frac{1}{σ _{c}^{(t - 1)}} ϕ ( \frac{γ - μ _{c}^{(t - 1)}}{σ _{c}^{(t - 1)}} ) d γ} .

\int h (γ) \frac{1}{σ} ϕ {(γ - μ) / σ} d γ \approx \frac{1}{π} m = 1 \sum M w_{m} h (γ_{m})

\int h (γ) \frac{1}{σ} ϕ {(γ - μ) / σ} d γ \approx \frac{1}{π} m = 1 \sum M w_{m} h (γ_{m})

ω_{i c m} = \frac{ω ~ _{i c m}}{\sum _{c = 1}^{C} \sum _{m = 1}^{M} ω ~ _{i c m}}, where \tilde{ω}_{i c m} = w_{m} f (Y_{i} ∣ X_{i}, γ^{(c, m)}; θ_{y}^{(t - 1)}) π_{c}^{(t - 1)} .

ω_{i c m} = \frac{ω ~ _{i c m}}{\sum _{c = 1}^{C} \sum _{m = 1}^{M} ω ~ _{i c m}}, where \tilde{ω}_{i c m} = w_{m} f (Y_{i} ∣ X_{i}, γ^{(c, m)}; θ_{y}^{(t - 1)}) π_{c}^{(t - 1)} .

π_{c}^{(t)} = \frac{1}{n} i = 1 \sum n m = 1 \sum M ω_{i c m}, μ_{c}^{(t)} = \frac{\sum _{i = 1}^{n} \sum _{m = 1}^{M} γ ^{(c, m)} ω _{i c m}}{\sum _{i = 1}^{n} \sum _{m = 1}^{M} ω _{i c m}},

π_{c}^{(t)} = \frac{1}{n} i = 1 \sum n m = 1 \sum M ω_{i c m}, μ_{c}^{(t)} = \frac{\sum _{i = 1}^{n} \sum _{m = 1}^{M} γ ^{(c, m)} ω _{i c m}}{\sum _{i = 1}^{n} \sum _{m = 1}^{M} ω _{i c m}},

(σ_{c}^{2})^{(t)} = \frac{\sum _{i = 1}^{n} \sum _{m = 1}^{M} ( γ ^{(c, m)} - μ _{c}^{(t)} ) ^{2} ω _{i c m} + 2 a _{n} σ _{p i l o t}^{2}}{\sum _{i = 1}^{n} \sum _{m = 1}^{M} ω _{i c m} + 2 a _{n}},

i = 1 \sum n c = 1 \sum C m = 1 \sum M ω_{i c m} log f (Y_{i} ∣ X_{i}, γ^{(c, m)}; θ_{y})

i = 1 \sum n c = 1 \sum C m = 1 \sum M ω_{i c m} log f (Y_{i} ∣ X_{i}, γ^{(c, m)}; θ_{y})

l max \frac{∣ θ _{l}^{(t)} - θ _{l}^{(t - 1)} ∣}{∣ θ _{l}^{(t - 1)} ∣ + 0.001} < 0.001,

l max \frac{∣ θ _{l}^{(t)} - θ _{l}^{(t - 1)} ∣}{∣ θ _{l}^{(t - 1)} ∣ + 0.001} < 0.001,

\int γ f (γ ∣ Y_{i}, X_{i}, θ) d γ = \frac{\sum _{c = 1}^{C} π _{c} \int γ f ( Y _{i} ∣ X _{i} , γ ; θ _{y} ) ϕ {( γ - μ _{c} ) / σ _{c} } / σ _{c} d γ}{\sum _{c = 1}^{C} π _{c} \int f ( Y _{i} ∣ X _{i} , γ ; θ _{y} ) ϕ {( γ - μ _{c} ) / σ _{c} } / σ _{c} d γ} .

\int γ f (γ ∣ Y_{i}, X_{i}, θ) d γ = \frac{\sum _{c = 1}^{C} π _{c} \int γ f ( Y _{i} ∣ X _{i} , γ ; θ _{y} ) ϕ {( γ - μ _{c} ) / σ _{c} } / σ _{c} d γ}{\sum _{c = 1}^{C} π _{c} \int f ( Y _{i} ∣ X _{i} , γ ; θ _{y} ) ϕ {( γ - μ _{c} ) / σ _{c} } / σ _{c} d γ} .

γ_{i} = c = 1 \sum C m = 1 \sum M γ^{(c, m)} ω_{i c m}

γ_{i} = c = 1 \sum C m = 1 \sum M γ^{(c, m)} ω_{i c m}

l_{p e n} (θ; Y, X) = l_{n} (θ; Y, X) + c = 1 \sum C p_{n} (σ_{c}^{2}),

l_{p e n} (θ; Y, X) = l_{n} (θ; Y, X) + c = 1 \sum C p_{n} (σ_{c}^{2}),

l_{n} (θ; Y, X) = i = 1 \sum n log \int {k = 1 \prod N_{i} f (Y_{ik} ∣ X_{ik}, γ; θ_{y}) g (γ ∣ θ_{γ})} d γ .

l_{n} (θ; Y, X) = i = 1 \sum n log \int {k = 1 \prod N_{i} f (Y_{ik} ∣ X_{ik}, γ; θ_{y}) g (γ ∣ θ_{γ})} d γ .

Θ_{C} =

Θ_{C} =

\displaystyle\ \ {\cal F}=\bigg{\{}{\boldsymbol{\theta}}\in\bar{\Theta}_{C};\ \int_{-\infty}^{({\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime})}f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}})d\mu({\boldsymbol{x}},{\boldsymbol{y}})=\int^{({\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime})}_{-\infty}f({\boldsymbol{x}},{\boldsymbol{y}},|{\boldsymbol{\theta}}_{0})d\mu({\boldsymbol{x}},{\boldsymbol{y}})\hbox{ for any }({\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime})\bigg{\}}.

\displaystyle\ \ {\cal F}=\bigg{\{}{\boldsymbol{\theta}}\in\bar{\Theta}_{C};\ \int_{-\infty}^{({\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime})}f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}})d\mu({\boldsymbol{x}},{\boldsymbol{y}})=\int^{({\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime})}_{-\infty}f({\boldsymbol{x}},{\boldsymbol{y}},|{\boldsymbol{\theta}}_{0})d\mu({\boldsymbol{x}},{\boldsymbol{y}})\hbox{ for any }({\boldsymbol{x}}^{\prime},{\boldsymbol{y}}^{\prime})\bigg{\}}.

θ_{f u l l} (τ) = ar g θ \in \overset{ˉ}{Θ}_{2} (τ) max l_{p e n} (θ) .

θ_{f u l l} (τ) = ar g θ \in \overset{ˉ}{Θ}_{2} (τ) max l_{p e n} (θ) .

T_{1} = τ \in T max T_{1} (τ) where T_{1} (τ) = 2 [l_{n} {θ_{f u l l} (τ)} - l_{n} (θ_{r e d})] .

T_{1} = τ \in T max T_{1} (τ) where T_{1} (τ) = 2 [l_{n} {θ_{f u l l} (τ)} - l_{n} (θ_{r e d})] .

N_{C + 1} (c, τ)

N_{C + 1} (c, τ)

θ_{f u l l} (c, τ) = ar g θ \in N_{C + 1} (c, τ) max l_{p e n} (θ) .

θ_{f u l l} (c, τ) = ar g θ \in N_{C + 1} (c, τ) max l_{p e n} (θ) .

μ_{c, f u l l} (c, τ) - μ_{c, 0} = O_{p} (n^{- 1/4}), μ_{c + 1, f u l l} (c, τ) - μ_{c, 0} = O_{p} (n^{- 1/4}),

μ_{c, f u l l} (c, τ) - μ_{c, 0} = O_{p} (n^{- 1/4}), μ_{c + 1, f u l l} (c, τ) - μ_{c, 0} = O_{p} (n^{- 1/4}),

σ_{c, f u l l} (c, τ) - σ_{c, 0} = O_{p} (n^{- 1/4}), σ_{c + 1, f u l l} (c, τ) - σ_{c, 0} = O_{p} (n^{- 1/4}),

T_{C} (τ) = c \in {1, 2, \dots, C} max T_{C} (c, τ), where T_{C} (c, τ) = 2 [l_{n} {θ_{f u l l} (c, τ)} - l_{n} (θ_{r e d})] .

T_{C} (τ) = c \in {1, 2, \dots, C} max T_{C} (c, τ), where T_{C} (c, τ) = 2 [l_{n} {θ_{f u l l} (c, τ)} - l_{n} (θ_{r e d})] .

T_{C} = τ \in T max T_{C} (τ) .

T_{C} = τ \in T max T_{C} (τ) .

\displaystyle\left(\begin{array}[]{c}\mu_{c}\\ \mu_{c+1}\\ \sigma_{c}^{2}\\ \sigma_{c+1}^{2}\end{array}\right)=\left(\begin{array}[]{c}\nu_{\mu}+(1-\tau)\lambda_{\mu}\\ \nu_{\mu}-\tau\lambda_{\mu}\\ \nu_{\sigma}+(1-\tau)(2\lambda_{\sigma}-\frac{1+\tau}{3}\lambda_{\mu}^{2})\\ \nu_{\sigma}-\tau(2\lambda_{\sigma}+\frac{2-\tau}{3}\lambda_{\mu}^{2})\end{array}\right),

\displaystyle\left(\begin{array}[]{c}\mu_{c}\\ \mu_{c+1}\\ \sigma_{c}^{2}\\ \sigma_{c+1}^{2}\end{array}\right)=\left(\begin{array}[]{c}\nu_{\mu}+(1-\tau)\lambda_{\mu}\\ \nu_{\mu}-\tau\lambda_{\mu}\\ \nu_{\sigma}+(1-\tau)(2\lambda_{\sigma}-\frac{1+\tau}{3}\lambda_{\mu}^{2})\\ \nu_{\sigma}-\tau(2\lambda_{\sigma}+\frac{2-\tau}{3}\lambda_{\mu}^{2})\end{array}\right),

\displaystyle\begin{array}[]{lllcl}\bm{\delta}(c)&=&(\pi_{1},\ldots,\pi_{c-1},&\pi_{c}+\pi_{c+1},&\pi_{c+2},\ldots,\pi_{C})^{\rm T},\\ {\boldsymbol{\mu}}(c)&=&(\mu_{1},\ldots,\mu_{c-1},&\nu_{\mu},&\mu_{c+2},\ldots,\mu_{C},\>\mu_{C+1})^{\rm T},\\ {\boldsymbol{\sigma}}^{2}(c)&=&(\sigma^{2}_{1},\ldots,\sigma^{2}_{c-1},&\nu_{\sigma},&\sigma^{2}_{c+2},\ldots,\sigma^{2}_{C},\>\sigma^{2}_{C+1})^{\rm T}.\end{array}

\displaystyle\begin{array}[]{lllcl}\bm{\delta}(c)&=&(\pi_{1},\ldots,\pi_{c-1},&\pi_{c}+\pi_{c+1},&\pi_{c+2},\ldots,\pi_{C})^{\rm T},\\ {\boldsymbol{\mu}}(c)&=&(\mu_{1},\ldots,\mu_{c-1},&\nu_{\mu},&\mu_{c+2},\ldots,\mu_{C},\>\mu_{C+1})^{\rm T},\\ {\boldsymbol{\sigma}}^{2}(c)&=&(\sigma^{2}_{1},\ldots,\sigma^{2}_{c-1},&\nu_{\sigma},&\sigma^{2}_{c+2},\ldots,\sigma^{2}_{C},\>\sigma^{2}_{C+1})^{\rm T}.\end{array}

η

η

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Liver Disease Diagnosis and Treatment · Statistical Methods and Bayesian Inference

Full text

Latent Gaussian Mixture Models for Nationwide Kidney Transplant Center Evaluation

Lanfeng Panlabel=e1][email protected] [

Yehua Lilabel=e2][email protected] [

Kevin Helabel=e3] [

Yanming Lilabel=e4] [

Yi Lilabel=e5] [ \thanksmarkd1 Department of Statistics & Statistical Laboratory, Iowa State University

\thanksmarkd2 School of Public Health & Kidney Epidemiology and Cost Center, University of Michigan, Ann Arbor

Abstract

Five year post-transplant survival rate is an important indicator on quality of care delivered by kidney transplant centers in the United States. To provide a fair assessment of each transplant center, an effect that represents the center-specific care quality, along with patient level risk factors, is often included in the risk adjustment model. In the past, the center effects have been modeled as either fixed effects or Gaussian random effects, with various pros and cons. Our numerical analyses reveal that the distributional assumptions do impact the prediction of center effects especially when the effect is extreme. To bridge the gap between these two approaches, we propose to model the transplant center effect as a latent random variable with a finite Gaussian mixture distribution. Such latent Gaussian mixture models provide a convenient framework to study the heterogeneity among the transplant centers. To overcome the weak identifiability issues, we propose to estimate the latent Gaussian mixture model using a penalized likelihood approach, and develop sequential locally restricted likelihood ratio tests to determine the number of components in the Gaussian mixture distribution. The fitted mixture model provides a convenient means of controlling the false discovery rate when screening for underperforming or outperforming transplant centers. The performance of the methods is verified by simulations and by the analysis of the motivating data example.

Clustering,

False discovery rate,

Health policy,

Locally restricted likelihood ratio test,

Penalized EM algorithm,

Latent variables,

keywords:

\setattribute

journalname \startlocaldefs

\endlocaldefs

, , ,

and

m3Correspondence should be addressed to Yehua Li ([email protected])

1 Introduction

This paper is motivated by the analysis of the national kidney transplant data, supported in part by the Health Resources and Services Administration. Renal failure is one of the most common and severe diseases in the nation. In 2013, a total of 117,162 new cases were reported (www.USRDS.org). Kidney transplantation, as a primary therapy for end stage renal disease, typically involves transplant surgeons and physicians, coordinators, social workers, financial counselors, nutritionists, psychologists, referring physicians, and the patients. The quality of care delivered by the transplant centers is often assessed by patient survival, for example, the 5 year survival rate post transplant.

To provide a fair assessment of each transplant center, patient level risk factors as well as an effect that represents the care quality of the transplant center are often included in the risk adjustment model. The Organ Procurement and Transplantation Network (OPTN), as a critical system in helping organ transplant institutions match waiting candidates with donated organs, contains all national data on the candidate waiting list, organ donation and matching, and transplantation. Kidney transplant database is a large component of OPTN, which includes the patient level risk factors such as demographical information, quality of the donor kidney, matching between the patient and the donor, as well as the transplant centers which operated the transplant surgeries. It is of substantial interest to estimate transplant center effects based on this national database, as they provide a data-driven basis for evaluation of national transplant centers and identification of underperforming or outperforming centers. The results may have health-policy making implications and facilitate patients’ choice of transplant centers.

Many statisticians and health policy researchers (Krumholz et al., 2006a, b; Li et al., 2009) advocate modeling the center effects as random effects that follow a Gaussian distribution. This approach ignores the heterogeneity among the transplant centers: there is a shrinkage effect in the prediction of the center level random effects and the assumption of a common Gaussian distribution makes the predicted random effects similar in value. He et al. (2013) argued that borrowing information from other transplant centers is not fair when the goal of the study is to evaluate and rank these centers. Instead, they suggested to model the transplant center effects as fixed effects. However, in such a fixed effects model, the number of parameters is large, making statistical inference numerically unstable, especially when the center size varies substantially. Estimating the effects of small centers with fewer patients presents even greater challenges. Indeed, in our national transplant study the number of patients treated by individual centers varies from 3 to 5830, with a median center size of 603. A comprehensive critic of these two approaches can be found in a report prepared by the Committee of Presidents of Statistical Societies (COPSS) through a contract with Centers for Medicare and Medicaid Services (Ash et al., 2012).

To bridge the gap between these two approaches, we propose to model the transplant center effects using a finite Gaussian mixture model. Our model has two advantages compared to the existing models. First, the model allows the presence of heterogeneities (e.g. the existence of clusters or subpopulations) among the transplant centers, making it a natural framework to identify under- or out-performing centers. Second, the mixture model can be considered as a compromise between the random effects model and the fixed effects model: it reduces to the random effects model when there is only one component in the mixture distribution and it becomes the fixed effects model if each transplant center is a cluster. Within the framework of generalized linear mixed effects models (GLMM), we will develop data-driven methods to determine the number components in the mixture random model.

Indeed, the vast majority of the GLMM literature assumes the distribution of the random effect is Gaussian, focuses on estimating the fixed effects and treats the random effects as nuisance (Breslow and Clayton, 1993; Lin and Breslow, 1996). Even though GLMM is in general robust against deviation from the Gaussian random effect assumption (McCulloch and Neuhaus, 2011), many authors have documented various drawbacks when the Gaussian assumption is violated, including loss of estimation efficiency (Chen, Zhang and Davidian, 2002), reduced power for statistical tests (Litière, Alonso and Molenberghs, 2007), etc. Even though the predicted random effects are relatively robust in terms of mean squared error, the shape of the distribution for the predicted random effect is highly sensitive and mostly reflects the shape of the assumed random effect distribution (McCulloch and Neuhaus, 2011). Many authors have tried to relax the Gaussian assumption and model the random effect in GLMM with more flexible distributions, such as the semi-nonparmatric distribution (Chen, Zhang and Davidian, 2002) and Gaussian mixture distribution (Caffo, An and Rohde, 2007). Caffo, An and Rohde (2007) proposed a similar model as ours. However, they limited their investigation to binary probit GLMM and focused on numerical performance rather than theoretical justification. We propose a test to check if it is necessary to model random effects as normal mixture as well as how many number of components are sufficient. In addition, another major difference is our goal is random effect itself instead of relaxing the assumption on it. By modeling the random effect as normal mixture we can do an evaluation on it in a FDR way.

Finite Gaussian mixture models (McLachlan G, 2004) are intuitively appealing for modeling non-homogeneity in a population and detecting subgroup structures. There has been a recent surge in application of Gaussian mixture models, including clustering analysis (Huang, Li and Guan, 2014), false discovery rate control (Efron, 2004; Liang and Zhang, 2008), genetic imprinting (Li et al., 2015). In contrast to its usefulness, however, estimation and statistical inference for Gaussian mixture models have been much difficult, because many regularity conditions in parametric inference are violated in Gaussian mixture models (Hathaway, 1985; Chen, 1995; Chen and Li, 2009). There has been much recent work in hypothesis testing on the order of finite Gaussian mixture models (Chen, Li and Fu, 2012; Kasahara and Shimotsu, 2015). However, none has studied GLMM with the random effects modeled with Gaussian mixtures.

The rest of the paper is organized as follows. We introduce the model in Section 2 and propose an EM-based estimation procedure in Section 3, where the consistency of the procedure is also established. To decide the number of mixture components, we propose sequential locally restricted likelihood ratio tests in Section 4. In Section 5, we propose a false discovery rate control procedure to evaluate the care qualities of the transplant centers. We conduct simulations in Section 6 and report the analysis of the OPTN kidney transplant data in Section 7. Finally, we end the paper with concluding remarks in Section 8. A simulation procedure to evaluate the null distribution for the test statistic in Section 4.2 is provided in the appendix, and all technical proofs and additional regularity conditions are deferred to the supplementary material.

2 Model and Assumptions

Suppose that there are $n$ independent transplant centers, each treating $N_{i}$ patients, which brings the total sample size to be $N=\sum_{i=1}^{n}N_{i}$ . Let $Y_{ik}$ be the outcome variable of the $k$ th patient treated at the $i$ th transplant center and let ${\boldsymbol{X}}_{ik}\in\mathbb{R}^{p}$ be the patient level covariate, $k=1,2\ldots N_{i}$ , $i=1,2,\ldots n$ . Denote by ${\boldsymbol{Y}}_{i}=(Y_{i1},\ldots,Y_{iN_{i}})^{\rm T}$ and ${\boldsymbol{X}}_{i}=({\boldsymbol{X}}_{i1},\ldots,{\boldsymbol{X}}_{iN_{i}})^{\rm T}$ , and let $\gamma_{i}$ be the random effect that represents the care qualify of the $i$ th center, and denote $\bm{\gamma}=(\gamma_{1},\ldots,\gamma_{n})^{T}$ . Suppose that the conditional density of $Y_{ik}$ , given ${\boldsymbol{X}}_{ik}$ and $\gamma_{i}$ , belongs to the canonical exponential family:

[TABLE]

where $a(\cdot)$ , $b(\cdot)$ and $d(\cdot)$ are known functions, $\xi_{ik}={\boldsymbol{X}}_{ik}^{T}\bm{\beta}+\gamma_{i}$ is the canonical parameter with ${\rm E}(Y_{ik}|{\boldsymbol{X}}_{ik},\gamma_{i})=b^{\prime}(\xi_{ik})$ , and $\varphi$ is a nuisance parameter. We also assume that $Y_{ik}$ and $Y_{ik^{\prime}}$ are independent given $\gamma_{i}$ , for any $k\neq k^{\prime}$ . In our transplant center evaluation application, we consider binary response variable: $Y_{ik}=1$ if the patient deceased within 5 years after transplant; $-1$ otherwise. In the dataset, there were essentially no censoring within the first 5 years as the transplant patients’ survival information had been closely monitored and tracked. This gives the justification of treating 5 year survival as a binary outcome data. With that, model (1) becomes $f({Y}_{ik}|{\boldsymbol{X}}_{ik},\gamma_{i};\bm{\beta})=\{1+\exp(-\xi_{ik}Y_{ik})\}^{-1}$ .

Assume that the transplant centers belong to $C$ subpopulations and the $c$ th subpopulation can be described by a Gaussian distribution with mean $\mu_{c}$ and variance $\sigma_{c}^{2}$ , $c=1,\ldots,C$ . Marginally, the density of $\gamma_{i}$ is $g(\gamma|{\boldsymbol{\theta}}_{\gamma})=\sum_{c=1}^{C}\pi_{c}f_{c}(\gamma|\mu_{c},\sigma_{c})$ , where $f_{c}(\gamma|\mu_{c},\sigma_{c})=\sigma_{c}^{-1}\phi\{(\gamma-\mu_{c})/\sigma_{c}\}$ , $\phi(\cdot)$ is the standard Gaussian density, $\pi_{c}\in[0,1]$ is the weight for subpopulation $c$ , $\sum_{c=1}^{C}\pi_{c}=1$ , and ${\boldsymbol{\theta}}_{\gamma}=(\mu_{1},\ldots,\mu_{C},\sigma_{1},\ldots,\sigma_{C},\pi_{1},\ldots,\pi_{C})^{T}$ is the collection of parameters in $g(\gamma)$ .

Denote ${\boldsymbol{Y}}=({\boldsymbol{Y}}_{1}^{\rm T},\ldots,{\boldsymbol{Y}}_{n}^{\rm T})^{\rm T}$ , ${\boldsymbol{X}}=({\boldsymbol{X}}_{1}^{\rm T},\ldots,{\boldsymbol{X}}_{n}^{\rm T})^{\rm T}$ , and ${\boldsymbol{\theta}}=({\boldsymbol{\theta}}_{y}^{\rm T},{\boldsymbol{\theta}}_{\gamma}^{\rm T})^{\rm T}$ where ${\boldsymbol{\theta}}_{y}=(\bm{\beta}^{\rm T},\varphi)^{\rm T}$ . To facilitate an EM algorithm, define $\bm{L}_{i}=({L}_{i1},\ldots,{L}_{iC})^{T}\sim$ Multinomial $(\pi_{1},\ldots,\pi_{C})$ as a latent random vector of subpopulation memberships, where $L_{ic}=1$ if $\gamma_{i}$ belongs to component $c$ and $L_{ic}=0$ otherwise. Then the likelihood function for the complete data, comprising of both observed and latent variables, is

[TABLE]

where $\ell_{i,comp}({\boldsymbol{\theta}};{\boldsymbol{Y}}_{i},{\boldsymbol{X}}_{i},\gamma_{i},{\boldsymbol{L}}_{i})=\hbox{log}f({\boldsymbol{Y}}_{i}|{\boldsymbol{X}}_{i},\gamma_{i};{\boldsymbol{\theta}}_{y})+\sum_{c=1}^{C}L_{ic}[\hbox{log}\pi_{c}-\frac{1}{2}\hbox{log}(\sigma_{c}^{2})+\hbox{log}\phi\{(\gamma_{i}-\mu_{c})/\sigma_{c}\}]$ and $f({\boldsymbol{Y}}_{i}|{\boldsymbol{X}}_{i},\gamma_{i};{\boldsymbol{\theta}}_{y})=\prod_{k=1}^{N_{i}}f(Y_{ik}|{\boldsymbol{X}}_{ik},\gamma_{i};{\boldsymbol{\theta}}_{y})$ .

3 Estimation Procedure

Though conceptually appealing, Gaussian mixture models possess some undesirable properties: slower convergence rate if the number of components is unknown (Chen, 1995); unbounded likelihood if any of the component variance parameters $\sigma_{c}^{2}$ goes to 0 (Hathaway, 1985); and infinite Fisher information on some boundary points of the parameter space (Chen and Li, 2009). The solution to these problems in the literature is to either restrict the value of the parameters away from the boundaries (Hathaway, 1985) or include a penalty function to prevent any $\sigma_{c}$ from converging to 0 (Chen, Tan and Zhang, 2008; Chen and Li, 2009).

We propose to adopt the latter by maximizing a penalized complete data likelihood

[TABLE]

while treating ${\boldsymbol{\gamma}}$ and ${\boldsymbol{L}}$ as missing data. Chen, Tan and Zhang (2008) provided asymptotic conditions on $p_{n}(\sigma^{2})$ that ensures the consistency of the estimator. In all of our numerical studies, we use the following penalty proposed by Chen and Li (2009)

[TABLE]

where $\widehat{\sigma}^{2}_{pilot}$ is a pilot estimate for the variance of $\gamma$ . One possible choice of $\widehat{\sigma}^{2}_{pilot}$ is the variance estimator assuming $\gamma_{i}$ are i.i.d. Gaussian variables. When $a_{n}=o_{p}(n^{1/4})$ , the penalty function in (3) satisfies the assumptions for our asymptotic theory. A similar requirement on $a_{n}$ is made by Chen and Li (2009).

3.1 EM algorithm with Gauss-Hermite quadrature

We propose an EM algorithm to maximize the penalized likelihood. At the $t$ th iteration of the algorithm, given the parameter value ${\boldsymbol{\theta}}^{(t-1)}$ from the previous iteration, we first evaluate the following loss function at the E-step

[TABLE]

where

[TABLE]

Integrals with respect to a Gaussian density can be well approximated by Gauss-Hermite quadrature:

[TABLE]

where $h(\gamma)$ is an integrable real valued function, $\gamma_{m}=\mu+\sqrt{2}\sigma d_{m}$ , $d_{1},d_{2},\ldots,d_{M}$ are the Gauss-Hermite abscissas and $w_{1},w_{2},\ldots,w_{M}$ are the corresponding quadrature weights. We find in our numerical studies that using $M=100$ quadrature points usually provides a close enough approximation. More details on the Gauss-Hermite approximation of the loss function, $\widehat{Q}({\boldsymbol{\theta}}|{\boldsymbol{\theta}}^{(t-1)})$ , are provided in the supplementary material.

In the $M$ -step, we maximize $\widehat{Q}({\boldsymbol{\theta}}|{\boldsymbol{\theta}}^{(t-1)})$ with respect to ${\boldsymbol{\theta}}$ . Define $\gamma^{(c,m)}=\mu_{c}^{(t-1)}+\sqrt{2}\sigma_{c}^{(t-1)}d_{m}$ ,

[TABLE]

We then update different components of ${\boldsymbol{\theta}}$

[TABLE]

and obtain ${\boldsymbol{\theta}}_{y}^{(t)}$ by maximizing

[TABLE]

using iteratively reweighted least squares. We adopt the rule of Booth and Hobert (1999) and declare the algorithm converges at iteration $t$ if

[TABLE]

where $\theta_{l}$ is the $l$ th entry in ${\boldsymbol{\theta}}$ .

At convergence, the weight $\omega_{icm}$ can be used to calculate some other quantities of interest, such as the marginal likelihood, the posterior probability of $\gamma_{i}$ belonging to the $c$ th component and posterior mean of $\gamma_{i}$ . For example, we predict $\gamma_{i}$ by its posterior mean

[TABLE]

Using the Gauss-Hermite approximation, the posterior mean is approximated as

[TABLE]

where $\omega_{icm}$ is defined in (5) evaluated at $\widehat{\boldsymbol{\theta}}$ .

To obtain some reasonable initial values for ${\boldsymbol{\theta}}_{y}$ and ${\boldsymbol{\theta}}_{\gamma}$ , we first run a generalized linear mixed model assuming $\gamma_{i}$ ’s are i.i.d. normal. We use the estimated fixed effects as initial values for $\bm{\theta}_{y}$ , fit a Gaussian mixture model on the predicted values $\widehat{\bm{\gamma}}$ and use the results as the initial values for ${\boldsymbol{\theta}}_{\gamma}$ .

3.2 Consistency of the estimator

The EM algorithm essentially maximizes the following penalized marginal likelihood

[TABLE]

where

[TABLE]

The parameter space for a model with exactly $C$ components is

[TABLE]

The closure of $\Theta_{C}$ is $\bar{\Theta}_{C}=\{{\boldsymbol{\theta}}\mid\bm{\beta}\in\mathbb{R}{}^{p},\ \hbox{$ \sum_{c=1}^{C} $}\pi_{c}=1,\ 0\leq\pi_{c}\leq 1,\ \mu_{1}\leq\mu_{2}\leq\cdots\leq\mu_{C},\ \sigma_{c}\geq 0,\ c=1,2,\ldots,C\},$ which also includes the over-fitted models. In other words, $\bar{\Theta}_{C}$ admits models where the true number of components is strictly less than $C$ . There are multiple ways to parameterize an extra component in $\bar{\Theta}_{C}$ . For example, setting either $\pi_{c}=0$ or $(\mu_{c},\sigma_{c})=(\mu_{c^{\prime}},\sigma_{c^{\prime}})$ for some $c^{\prime}\neq c$ means component $c$ does not exist. Various parameter values under these circumstances are identified as a single value, because they lead to the same mixture model. Let ${\boldsymbol{\theta}}_{0}\in\bar{\Theta}_{C}$ be the true parameter, $f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}})$ be the joint distribution function of $({\boldsymbol{X}},{\boldsymbol{Y}})$ associated with the likelihood in (8) and

[TABLE]

Following Hathaway (1985), we identify ${\cal F}$ as a single point, stated as Assumption 4 in the supplementary material.

Denote the maximum penalized likelihood estimator under a $C$ -component mixture model by $\widehat{{\boldsymbol{\theta}}}_{C}=\arg\max_{{\boldsymbol{\theta}}\in\bar{\Theta}_{C}}l_{pen}({\boldsymbol{\theta}}).$ Because $\widehat{\boldsymbol{\theta}}_{C}$ can be considered as a modified maximum likelihood estimator (Kiefer and Wolfowitz, 1956), its consistency follows from similar arguments as in Kiefer and Wolfowitz (1956) and Hathaway (1985). The consistency for $\widehat{\boldsymbol{\theta}}_{C}$ is established in the following proposition, the proof of which is relegated to the supplementary material.

Proposition 1.

Under Assumptions 1-6 in the supplementary material, $\widehat{{\boldsymbol{\theta}}}_{C}$ is consistent in the sense $\inf_{{\boldsymbol{\theta}}^{\ast}\in{\cal F}}\|\widehat{\boldsymbol{\theta}}_{C}-{\boldsymbol{\theta}}^{\ast}\|\to 0$ in probability.

4 Deciding the Number of Mixture Components

Deciding the number of components is key in answering whether there are subgroups of transplant centers that are under-performing or out-performing the rest. There are two commonly used approaches, the model selection approach (Ishwaran, James and Sun, 2001; Woo and Sriram, 2006) and the hypothesis testing approach, with different focuses as argued in Chen, Li and Fu (2012). The model selection approach seeks a model to adequately describe the data, while the hypothesis testing approach is used to validate scientific claims. In this paper, we focus on the hypothesis testing approach because it quantifies the uncertainty of our decisions by providing $p$ -values. Among many hypotheses that we can test, the most important one is $H_{0}:\ C_{0}=1$ vs $H_{1}:\ C_{0}=2$ , where $C_{0}$ is the true number of components. This test is also referred to as the homogeneity test, since the null hypothesis means all transplant centers are from the same homogeneous population and none are under or over performing. If $H_{0}:\ C_{0}=1$ is rejected, we will also sequentially test other hypotheses of the form $H_{0}:\ C_{0}=C$ vs $H_{1}:\ C_{0}=C+1$ , $C=2,3,\ldots$ , in search for the true number of components.

Because of the loss of strong identifiability for finite Gaussian mixture models, the regular asymptotic theory for likelihood ratio tests (LRT) does not hold. Instead, Chen, Li and Fu (2012) and Kasahara and Shimotsu (2015) proposed a locally restricted likelihood ratio test that confines the parameter space in a local alternative model to ensure the existence of an asymptotic distribution for the test statistic. We extend such a test to the GLMM setting.

4.1 Homogeneity Test

We first consider $H_{0}:C_{0}=1$ vs $H_{1}:C_{0}=2$ . We refer to the model under the null hypothesis as the reduced model and that under the alternative as the full model. When the null hypothesis is true, $\gamma_{i}$ are i.i.d. random variables following $\hbox{Normal}(\mu_{\gamma},\sigma_{\gamma}^{2})$ . However, this model is not uniquely parameterized in the full model, unless we restrict the values of some parameters. Following Chen, Li and Fu (2012), we restrict the parameter space under the full model to $\bar{\Theta}_{2}(\tau)=\{{\boldsymbol{\theta}}=(\mu_{1},\mu_{2},\sigma_{1},\sigma_{2},\pi_{1},\pi_{2})^{\rm T};\ \mu_{1},\mu_{2}\in\mathbb{R},\sigma_{1},\sigma_{2}\geq 0,\pi_{1}=\tau,\pi_{2}=1-\tau\}$ , for a fixed $\tau\in(0,0.5]$ . By doing so, we do not impose any constraints on the order between $\mu_{1}$ and $\mu_{2}$ . In $\bar{\Theta}_{2}(\tau)$ , the null model is uniquely parameterized by ${\boldsymbol{\theta}}_{0}(\tau)=\{{\boldsymbol{\theta}}_{y,0}^{\rm T},{\boldsymbol{\theta}}_{\gamma,0}^{\rm T}(\tau)\}^{\rm T}$ , where ${\boldsymbol{\theta}}_{\gamma,0}(\tau)=(\mu_{\gamma},\mu_{\gamma},\sigma_{\gamma},\sigma_{\gamma},\tau,1-\tau)^{\rm T}$ .

4.1.1 Asymptotic Behavior of the Estimators

Let $\bar{\Theta}_{1}$ be the parameter space when $C_{0}=1$ and the reduced model estimator be $\widehat{{\boldsymbol{\theta}}}_{red}=\arg\max_{{\boldsymbol{\theta}}\in\bar{\Theta}_{1}}l_{pen}({\boldsymbol{\theta}})$ , which is the usual MLE for GLMM under Gaussian random effect assumption. Under the full model, the estimator under a fixed $\tau$ is

[TABLE]

This estimator can be obtained using the EM algorithm described in Section 3 without the step for updating $\pi_{c}$ ’s. The following proposition provides the convergence rate of $\widehat{{\boldsymbol{\theta}}}_{full}(\tau)$ under the null hypothesis.

Proposition 2.

Under $H_{0}:C_{0}=1$ and Assumptions 1-7 in the supplementary material, for any fixed $\tau\in(0,0.5]$ , $\widehat{\bm{\beta}}_{full}(\tau)-\bm{\beta}_{0}=O_{p}(n^{-1/2})$ , and $\widehat{\boldsymbol{\theta}}_{\gamma,full}(\tau)-{\boldsymbol{\theta}}_{\gamma,0}(\tau)=O_{p}(n^{-1/4})$ .

Remark: We use a similar reparameterization as Kasahara and Shimotsu (2015) in the proof of Proposition 2. As shown in the proof, many derivatives of the log likelihood are either exactly zero or have mean zero, and it takes a ninth order Taylor expansion to get a local quadratic approximation to the penalized likelihood. The convergence rate in the proposition means that, for an over-fitted mixture model, the GLMM regression coefficient ${\boldsymbol{\beta}}$ still enjoys the root- $n$ convergence rate, while the parameters of the latent Gaussian mixture model converge much slower. This slow convergence rate also stresses a fundamental difference between our latent Gaussian mixture model and the common parametric models.

4.1.2 Test Procedure

Let $\bm{\mathcal{T}}$ be any subset of numbers in $(0,0.5]$ , define the test statistic

[TABLE]

Proposition 3.

Under $H_{0}:C_{0}=1$ and Assumptions 1-7, $\widetilde{T}_{1}\buildrel d\over{\longrightarrow}\chi^{2}(2)$ as $n\to\infty$ .

Remark: Our proof of Proposition 3 shows that, under $H_{0}:C_{0}=1$ , $T_{1}(\tau)\buildrel d\over{\longrightarrow}\chi^{2}(2)$ for any fixed $\tau$ . In fact, if there is only one true component, no matter how we choose to split that component, the leading term in the asymptotic expansion of $T_{1}(\tau)$ remains the same. We define $\widetilde{T}_{1}$ as the maximum of $T_{1}(\tau)$ over ${\cal T}$ to increase the power: if $H_{1}$ is true, the more values of $\tau$ we try, the better chance we have to detect an extra component. Proposition 3 holds if $\widetilde{T}_{1}$ is the maximum of $T_{1}(\tau)$ over the whole interval $(0,0.5]$ , but for practical consideration ${\cal T}$ is often taken as a finite subset.

The detailed test procedure is given as follows.

Step 0. Obtain $\widehat{\boldsymbol{\theta}}_{red}$ and $l_{n}(\widehat{\boldsymbol{\theta}}_{red})$ .

Step 1. For a fixed $\tau$ , obtain $\widehat{\boldsymbol{\theta}}_{full}(\tau)$ . To guarantee a global maximum of the penalized likelihood is reached, try 100 randomly selected initial values for ${\boldsymbol{\theta}}(\tau)$ .

Step 2. (Optional) Using $\widehat{{\boldsymbol{\theta}}}_{full}(\tau)$ obtained in Step 1 as the starting value, perform two more EM iterations without fixing $\tau$ , and use the resulting estimator to evaluate $T_{1}(\tau)$ .

Step 3. Repeat Steps 1 and 2 for each $\tau\in\bm{\mathcal{T}}$ to obtain $\widetilde{T}_{1}$ , where $\bm{\mathcal{T}}$ is set to be $\{0.1,0.3,0.5\}$ following the recommendation of Chen, Li and Fu (2012).

Step 4. For a size $\alpha$ test, reject $H_{0}:C_{0}=1$ if $\widetilde{T}_{1}>\chi^{2}_{\alpha}(2)$ .

In Step 2, we perform two more EM iterations without fixing $\tau$ to increase the power of the test, which is the recommendation of Chen, Li and Fu (2012).

4.2 Testing for C greater than 2

Next, we consider a test $H_{0}:C_{0}=C$ vs $H_{1}:C_{0}=C+1$ for a $C\geq 2$ . We now refer to the model with $C$ components as the reduced model and that with $C+1$ components as the full model. We first estimate the reduced model and let the reduced model estimator be $\widehat{{\boldsymbol{\theta}}}_{red}=\arg\max_{{\boldsymbol{\theta}}\in\bar{\Theta}_{C}}l_{pen}({\boldsymbol{\theta}})$ . Assuming $H_{0}$ is true, denote the true value of the parameter by ${\boldsymbol{\theta}}_{0}$ and order the true mean parameters by $\mu_{1,0}<\mu_{2,0}<\cdots<\mu_{C,0}$ . This parameter is not uniquely identified in the full model: if any $\pi_{c}=0$ or $(\mu_{c},\sigma_{c})=(\mu_{c+1},\sigma_{c+1})$ for some $c\in\{1,2,\ldots,C\}$ , the full model degenerates to the reduced model. In order to make the reduced model identifiable in $\bar{\Theta}_{C+1}$ , we will impose constraints that $\pi_{c}>0$ for all $c=1,\ldots,C+1$ and $\pi_{c}/(\pi_{c}+\pi_{c+1})=\tau$ for some $c$ and a fixed $\tau\in(0,0.5]$ like we did in Section 4.1.

4.2.1 Locally Restricted Full Model Estimators

To test if a $(C+1)$ -component mixture model fits the data better, we will test to see if any one of the $C$ components in the reduced model can be further split into two. Define non-overlapping intervals $D_{1},\ldots,D_{C}$ such that $\mu_{c,0}\in D_{c}$ . For a fixed $\tau\in(0,0.5]$ and $c\in\{1,\ldots,C\}$ , define neighborhoods in the parameter space $\bar{\Theta}_{C+1}$

[TABLE]

The neighborhood ${\cal N}_{C+1}(c,\tau)$ collects the parameters that split the $c$ th component into two daughter components with a split proportion $\tau$ , while restricting the other mean parameters from changing too much. The definition of ${\cal N}_{C+1}(c,\tau)$ requires knowledge about intervals $\{D_{1},D_{2},\ldots,D_{C}\}$ that contain the true mean parameters. In practice, we already have consistent estimator of $\mu_{c,0}$ from fitting the reduced model, replacing $\{D_{c}\}_{c=1}^{C}$ with their consistent estimates does not affect the asymptotic behavior of the test we are about to propose. A practical choice for $\{D_{c}\}_{c=1}^{C}$ is provided below in the test procedure. Like in Section 4.1, we do not restrict order between $\mu_{c}$ and $\mu_{c+1}$ in ${\cal N}_{C+1}(c,\tau)$ because $\tau$ is restricted in $(0,0.5]$ .

Define the locally restricted full model estimator as

[TABLE]

To obtain this estimator, we need some minor adjustments to the EM algorithm in Section 3. First, we update $\pi_{c}+\pi_{c+1}$ as a single parameter and then assign values for $\pi_{c}$ and $\pi_{c+1}$ proportional to $\tau$ . Second, after each $M$ -step, we enforce the restrictions in ${\cal N}_{C+1}(c,\tau)$ by forcing any $\mu_{c^{\prime}}$ stepping out of boundary back to its predetermined range. A similar scheme is used in Chen, Li and Fu (2012).

The following convergence rate result echoes Proposition 2. It shows that the component that we are trying to split suffers a slower convergence rate, because it is overfitted in ${\cal N}_{C+1}(c,\tau)$ as a mixture of two daughter components, and the rest of the parameters converge in root- $n$ rate.

Proposition 4.

Under $H_{0}:C_{0}=C$ and Assumptions 1-8 in the supplementary material, for any fixed $\tau\in(0,0.5]$ , then

[TABLE]

and $\widehat{\boldsymbol{\theta}}_{y,full}(c,\tau)-{\boldsymbol{\theta}}_{y0}=O_{p}(n^{-1/2})$ , $\widehat{\boldsymbol{\theta}}_{\gamma,c^{\prime},full}(c,\tau)-{\boldsymbol{\theta}}_{\gamma,c^{\prime},0}=O_{p}(n^{-1/2})$ for $c^{\prime}<c$ , $\widehat{\boldsymbol{\theta}}_{\gamma,c^{\prime},full}(c,\tau)-{\boldsymbol{\theta}}_{\gamma,c^{\prime}-1,0}=O_{p}(n^{-1/2})$ for $c^{\prime}>c+1$ , where ${\boldsymbol{\theta}}_{\gamma,c^{\prime}}=(\mu_{c^{\prime}},\sigma_{c^{\prime}},\pi_{c^{\prime}})^{\rm T}$ .

4.2.2 Local Reparameterization, Test Statistic and Asymptotics

To test if any component in the reduced model can be further divided into two, define the test statistic

[TABLE]

Let ${\cal T}$ be any finite subset of $(0,0.5]$ , define test statistic

[TABLE]

In order to understand the asymptotic behavior of $T_{C}(c,\tau)$ , we adopt the reparameterization of Kasahara and Shimotsu (2015) in ${\cal N}_{C+1}(c,\tau)$ . Define the new parameter vector as ${\boldsymbol{\psi}}(c,\tau)=({\boldsymbol{\theta}}_{y}^{\rm T},\bm{\delta}(c)^{\rm T},{\boldsymbol{\mu}}(c)^{\rm T},{\boldsymbol{\sigma}}^{2}(c)^{\rm T},\lambda_{\mu},\lambda_{\sigma})^{\rm T}$ such that

[TABLE]

and

[TABLE]

Denote the new parameter space as $\bar{\Theta}_{\psi,C+1}$ and partition ${\boldsymbol{\psi}}$ into $({\boldsymbol{\eta}}^{\rm T},\bm{\lambda}^{\rm T})^{\rm T}$ where

[TABLE]

The reduced model is uniquely parameterized by ${\boldsymbol{\theta}}^{*}\in{\cal N}_{C+1}(c,\tau)$ , and it is reparameterized as ${\boldsymbol{\psi}}^{*}=\{({\boldsymbol{\eta}}^{*})^{\rm T},0,0\}^{\rm T}$ , or more specifically ${\boldsymbol{\theta}}_{y}={\boldsymbol{\theta}}_{y,0}$ , $\bm{\lambda}^{*}=\bm{0}$ and $\bm{\delta}^{*}(c)=(\pi_{1,0},\pi_{2,0},\ldots,\pi_{C-1,0})^{\rm T}$ , ${\boldsymbol{\mu}}^{*}(c)=(\mu_{1,0},\mu_{2,0},\ldots,\mu_{C,0})^{\rm T}$ , ${\boldsymbol{\sigma}}^{2*}(c)=(\sigma^{2}_{1,0},\sigma^{2}_{2,0},\ldots,\sigma^{2}_{C,0})^{\rm T}$ . The benefit of the reparameterization (22) is that, to test if the $c$ th component can be further split, we can equivalently test if ${\boldsymbol{\lambda}}={\boldsymbol{0}}$ .

Define the score function with respect to ${\boldsymbol{\psi}}(c,\tau)$ as

[TABLE]

where

[TABLE]

Here, we use the short hand notation $\zeta_{i}=\prod_{k=1}^{N_{i}}f(y_{ik}|{\boldsymbol{x}}_{ik},\gamma_{i};{\boldsymbol{\theta}}_{y})$ , $f_{c,i}^{*}=f_{c}(\gamma_{i}|\mu_{c,0},\sigma_{c,0})$ , $g_{i}^{*}=g(\gamma_{i}|{\boldsymbol{\theta}}_{\gamma}^{*})$ and $H_{ci}^{k*}=H^{k}\left(\frac{\gamma_{i}-\mu_{c,0}}{\sigma_{c,0}}\right)/(k!\sigma_{c,0}^{k})$ , where $H^{k}(\cdot)$ is the $k$ th Hermite Polynomial.

Proposition 5.

Under $H_{0}:C_{0}=C$ and Assumptions 1-8 in the supplementary material,

[TABLE]

where $\bm{S}_{\lambda|\eta,n}^{(c)}=\bm{S}_{\lambda,n}^{(c)}-\bm{\mathcal{I}}^{(c)}_{\lambda\eta}\bm{\mathcal{I}}_{\eta}^{-1}\bm{S}_{\eta,n}$ , $\bm{{\cal I}}^{(c)}_{\lambda|\eta}=\bm{{\cal I}}^{(c)}_{\lambda}-\bm{{\cal I}}^{(c)}_{\lambda\eta}\bm{{\cal I}}_{\eta}^{-1}(\bm{{\cal I}}^{(c)}_{\lambda\eta})^{T}$ , $\bm{S}_{\eta,n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{\boldsymbol{s}}_{\eta,i}$ , $\bm{S}_{\lambda,n}^{(c)}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{\boldsymbol{s}}_{\lambda,i}^{(c)}$ , $\bm{{\cal I}}^{(c)}_{\lambda\eta}=E\{{\boldsymbol{s}}_{\bm{\lambda},i}^{(c)}{\boldsymbol{s}}_{{\boldsymbol{\eta}},i}^{\rm T}\}$ , ${\boldsymbol{\cal I}}_{\eta}=E({\boldsymbol{s}}_{\eta,n}{\boldsymbol{s}}_{\eta,n}^{\rm T})$ , and $\bm{{\cal I}}^{(c)}_{\lambda}=E\{{\boldsymbol{s}}_{\bm{\lambda},i}^{(c)}({\boldsymbol{s}}_{\bm{\lambda},i}^{(c)})^{\rm T}\}$

One can show $(\bm{S}_{\lambda|\eta,n}^{(c)})^{T}(\bm{{\cal I}}_{\lambda|\eta}^{(c)})^{-1}\bm{S}_{\lambda|\eta,n}^{(c)}\buildrel d\over{\longrightarrow}\chi^{2}(2)$ for each $c$ , but the score vectors $\bm{S}_{\lambda|\eta,n}^{(c)}$ are dependent among different $c$ ’s and hence the distribution of $\widetilde{T}_{C}$ in Proposition 5 is that of the maximum of a few correlated $\chi^{2}(2)$ random variables. In Appendix A, we describe a simulation method to evaluate this asymptotic distribution. This procedure only requires estimation of the covariance matrix of $\{{\boldsymbol{S}}_{\lambda|\eta,n}^{(c)},c=1,\ldots,C\}$ and simulating Gaussian random variables. It is extremely fast and fundamentally different from bootstrap, which requires fitting the model a large number of times to the bootstrap samples.

4.2.3 Test Procedure

For any $C\geq 2$ , our test procedure for $H_{0}:C_{0}=C$ is as follows.

Step 0. Obtain $\widehat{\boldsymbol{\theta}}_{red}$ using penalty function (3) and $a_{n}=\frac{1}{n}$ , and evaluate $l_{n}(\widehat{\boldsymbol{\theta}}_{red})$ . Define subintervals $D_{1}=[\widehat{\gamma}_{min},\frac{\widehat{\mu}_{1,red}+\widehat{\mu}_{2,red}}{2}]$ , $D_{2}=(\frac{\widehat{\mu}_{1,red}+\widehat{\mu}_{2,red}}{2},\frac{\widehat{\mu}_{3,red}+\widehat{\mu}_{2,red}}{2}]$ , $\ldots D_{C}=(\frac{\widehat{\mu}_{C-1,red}+\widehat{\mu}_{C,red}}{2},\widehat{\gamma}_{max}]$ , where $\widehat{\gamma}_{min}$ and $\widehat{\gamma}_{max}$ are the minimum and maximum of the predicted ${\gamma}$ ’s.

Step 1. Obtain $\widehat{{\boldsymbol{\theta}}}_{full}(c,\tau)$ by maximizing the penalized likelihood in the restricted parameter neighborhood ${\cal N}_{C+1}(c,\tau)$ using the subintervals $\{D_{k}\}_{k=1}^{C}$ defined in Step 0. The penalty on $\sigma_{k}^{2}$ is $p_{n}(\sigma_{k}^{2},\widehat{\sigma}^{2}_{c^{\prime},red})$ if $\mu_{k}$ is restricted in $D_{c^{\prime}}$ , $k=1,\ldots,C+1$ , and $a_{n}$ is chosen according equation (23) in Kasahara and Shimotsu (2015). If a $\mu_{k}$ steps outside of its range $D_{c^{\prime}}$ specified by ${\cal N}_{C+1}(c,\tau)$ during the EM iterations, we will simply set it back to the nearest boundary of $D_{c^{\prime}}$ . To ensure that the maximum of $l_{pen}$ is reached, we repeat the EM algorithm 100 times using randomly selected initial values within ${\cal N}_{C+1}(c,\tau)$ .

Step 2. Using $\widehat{{\boldsymbol{\theta}}}_{full}(c,\tau)$ as starting value, do two more EM iterations without fixing $\tau$ . Use the resulted estimator to evaluate $T_{C}(c,\tau)$ in (12).

Step 3. Repeat Steps 1 and 2 for each $c=1,2,\ldots,C$ , and for each $\tau\in{\cal T}=\{0.1,0.3,0.5\}$ , and evaluate $\widetilde{T}_{C}$ in (13).

Step 4. Evaluate the asymptotic null distribution in Proposition 5 using the procedure described in Appendix A and compare $\widetilde{T}_{C}$ with the null distribution to get the $p$ value.

4.3 Sequential Test to Determine the Order of the Latent Gaussian Mixture Model

Hypothesis tests are not designed for model selection, but can nevertheless be used for such a purpose in an exploratory study. One can determine the order of the latent Gaussian mixture model by sequentially testing $H_{01}:C_{0}=1$ , $H_{02}:C_{0}=2$ , $H_{03}:C_{0}=3$ , $\ldots$ , and declare $C_{0}=C^{\ast}$ if $H_{0C^{\ast}}$ is the first null hypothesis in the sequence that is not rejected. Such a procedure is obviously not a consistent model selection procedure, as we have a fixed chance to fail to reject a hypothesis. On the other hand, one can also argue many widely used model selection procedures are not consistent, such as the Akaike Information Criterion. To control the family wise error rate at $\alpha$ , one can adopt a Bonferroni procedure and set the sizes of the tests to be $\alpha/2$ , $\alpha/4$ , $\alpha/8$ , $\ldots$ .

5 Transplant Center Evaluation with False Discovery Rate Control

One important goal of our study is to provide a ranking for the transplant centers. The evaluation is based on the value of the latent variable $\gamma$ , which represents the care quality of a center. For methodology development, we first assume that the number of mixture components $C_{0}$ is correctly specified and all parameters in the latent Gaussian mixture model are known.

Following Efron (2004), we identify the “empirical null” distribution of $\gamma$ as a subset of components in the mixture density, $g_{0}(\gamma|{\boldsymbol{\theta}}_{\gamma})=\sum_{c\in{\cal C}_{0}}\pi_{c}f_{c}(\gamma|\mu_{c},\sigma_{c})/\sum_{c\in{\cal C}_{0}}\pi_{c}$ where ${\cal C}_{0}\subset\{1,2,\ldots,C\}$ . For each transplant center $i$ , we will test if this center belongs to one of the components in ${\cal C}_{0}$ , or $H_{i0}:\sum_{c\in{\cal C}_{0}}L_{ic}=1$ , $i=1,\ldots,n$ . Suppose ${\cal C}_{0}$ consists of centers of average performance, then center $i$ is considered “interesting” (either outperforming or underperforming) if $H_{i0}$ is rejected.

Since $\gamma_{i}$ is not directly observed, our decision rule for $H_{i0}$ is based on the observed data ${\boldsymbol{X}}_{i}$ and ${\boldsymbol{Y}}_{i}$ , denoted as $\delta_{i}=\delta({\boldsymbol{X}}_{i},{\boldsymbol{Y}}_{i};{\boldsymbol{\theta}})$ , where $\delta_{i}=1$ means center $i$ is “interesting” and $\delta_{i}=0$ otherwise. The false discovery rate is defined as

[TABLE]

When $\gamma_{i}$ ’s are observed, Sun and Cai (2007) show that the oracle decision rule is based on the local FDR, $T_{\rm OR}(\gamma_{i})=P(\sum_{c\in{\cal C}_{0}}L_{ic}=1|\gamma_{i})=\sum_{c\in{\cal C}_{0}}\pi_{c}f_{c}(\gamma_{i})/g(\gamma_{i})$ . In our case, $\gamma_{i}$ is not observed, and the local FDR is defined as

[TABLE]

It is easy to show $lFDR_{i}=E\{T_{\rm OR}(\gamma_{i})|{\boldsymbol{X}}_{i},{\boldsymbol{Y}}_{i}\}$ . Following Sun et al. (2015), the multiple hypotheses testing problem is related to a classification problem with the loss function

[TABLE]

where $\lambda$ is a penalty for false positive. Let $\mathscr{R}=E\{\mathscr{L}({\boldsymbol{L}},{\boldsymbol{\delta}})\}$ be the risk of the classification problem, and by Theorem 1 of Sun et al. (2015), the optimal decision rule that minimizes this risk is $\delta_{i}=I(lFDR_{i}<t)$ for some threshold $t$ .

Let ${lFDR}_{(1)}\leq{lFDR}_{(2)}\leq\cdots\leq{lFDR}_{(n)}$ be the ranked lFDR values. For any $\alpha>0$ , let $k=\max_{i}\{{1\over i}\sum_{j=1}^{i}lFDR_{(j)}\leq\alpha\}$ and our FDR control procedure is to reject all $H_{i0}$ with the rank of $lFDR_{i}$ less or equal to $k$ .

Proposition 6.

Under the model in (1), the above procedure controls FDR at level $\alpha$ .

A sketch proof of Proposition 6 is provided in Section S.6 of the supplementary material. In practice, $lFDR$ is estimated by substituting ${\boldsymbol{\theta}}$ with its estimator and the integrals in (29) are evaluated using Gaussian quadrature as described above.

6 Simulation Studies

We conduct simulation studies to examine the numerical performance of proposed estimation procedure and the validity and power of the proposed tests in choosing the order of the latent Gaussian mixture model.

6.1 Simulation 1: Estimation and Random Effect Prediction

We simulate data for $n=282$ transplant centers, which is the number of kidney transplant centers in OPTN in year 2008. The number of patients per center has a highly skewed distribution in the real data. To mimic such a distribution, we generate $N_{i}$ as the integer part of the sum of $Poission(5)$ and $Exponential(45)$ . The response $Y_{ik}$ is a binary variable generated using (1) with $P(Y_{ik}=1)=\{1+\exp(-\xi_{ik})\}^{-1}$ , where $\xi_{ik}={\boldsymbol{X}}_{ik}^{\rm T}{\boldsymbol{\beta}}+\gamma_{i}$ . ${\boldsymbol{X}}$ is generated from bivariate standard normal and $\bm{\beta}=(1,1)^{\rm T}$ . In the following subsections, we generate $\gamma_{i}$ ’s from Gaussian mixture models with different orders.

6.1.1 Two-Component Model

We first generate $\gamma_{i}$ ’s from a two-component Gaussian mixture model

[TABLE]

The parameters in Model 1 are selected such that the marginal probability of $\{Y_{ik}=1\}$ is roughly the same as the real data. We repeat the simulation 200 times and apply the estimation procedure in Section 3 to each simulated data set. The mixture components in the estimated model are ranked according to the value $\widehat{\mu}_{c}$ to avoid the cluster label switching problem. The results for parameter estimation under correctly specified number of components are summarized in Table 1. As we can see, all estimators perform well: the biases are much smaller than the standard deviations, showing that our estimators are asymptotically unbiased.

To illustrate the drawback for mis-specifying the random effect distribution, we also fit a common GLMM model to the simulated data under the assumption that $\gamma_{i}$ ’s are i.i.d. Gaussian. Figure 1 illustrates the results in a typical simulation run. The upper panel shows the results of a common GLMM, and the lower panel shows the results of the proposed model. In both panels, we compare the true density of $\gamma$ with the estimated density using the fitted model and the kernel density of the predicted $\gamma$ using the fitted model. As we can see from the upper panel, prediction under the mis-specified Gaussian random effect assumption suffers from a shrinkage effect that the values of $\widehat{\gamma}$ are pushed towards the center of the distribution so that the posterior distribution resembles the shape of a Gaussian distribution. The lower panel shows that prediction under our proposed model does not suffer from such a shrinkage effect. Our model recovers the shape of the latent variable distribution and produces better predictions. In Table 3, we also report the mean square prediction error for the random effect averaged over the 200 simulation runs and the Monte Carlo standard deviation of the prediction error. As we can see, when the random effect distribution is mis-specified as Gaussian, the fitted model yields a much larger prediction error.

6.1.2 Three-Component Model

We repeat the simulation study while generating $\gamma_{i}$ ’s from the following three-component Gaussian mixture model

[TABLE]

We repeat the simulation 200 times, perform the proposed estimation procedure under correctly specified order of mixture, and the estimation results are summarized in Table 2. We can see that the estimation results are quite reasonable: all biases are virtually zero; the standard errors for component means ( $\mu_{c}$ ) and component standard deviations ( $\sigma_{c}$ ) are slightly inflated compared with Table 1, which is understandable since we are fitting a more complicated mixture model; the standard errors for ${\boldsymbol{\beta}}$ are not affected by the increased complicity of the latent mixture model.

In Table 3, we also present the mean square prediction error of the proposed model averaged over 200 simulation runs, Monte Carlo standard deviation of the prediction error, and the same quantities under GLMM with Gaussian random effects. As we can see the prediction error under the common GLMM with Gaussian assumption has much bigger prediction error than the proposed model. The gap between the prediction errors from the two models is even bigger than for Model 1, because Model 2 is even more heterogeneous.

6.2 Simulation 2: Hypothesis Tests

Next, we investigate the validity and power for the proposed tests in Section 4.

6.2.1 Asymptotic Null Distributions

We generate simulated data under similar settings as in Simulation 1, while $\gamma_{i}$ ’s are generated from three models: Model 1, Model 2 and

[TABLE]

The three models represent latent Gaussian mixture models with orders 1 to 3. We generate 200 simulated data sets under each of the three models, and compute $\widetilde{T}_{1}$ in data under Model 0, $\widetilde{T}_{2}$ under Model 1 and $\widetilde{T}_{3}$ under Model 2. The empirical distributions of the three quantities represent the null distribution for the test statistics under the null hypotheses $C_{0}=1,2$ and $3$ respectively. These empirical distributions are provided in Figure 2 and compared with the asymptotic distributions provided in Section 4. In each panel of Figure 2, the dash curve is the kernel density based on 200 replicates of the test statistic and solid curve is the asymptotic distribution. Note that the asymptotic distribution for $\widetilde{T}_{2}$ and $\widetilde{T}_{3}$ are based on 10,000 simulations using the procedure described in Appendix A. As we can see, the empirical distributions of the test statistics are remarkably close to the asymptotic distribution, which also shows the validity of the proposed tests.

6.2.2 Power of the tests

Next, we illustrate the power of the tests. The response $Y$ is generated the same way as in Section 6.1, while $\gamma$ is generated from the following two models

[TABLE]

Compared with Models 1 and 2 considered in Section 6.1, the individual components in Models 3 and 4 are less separated, making it harder to detect the real order of these models especially when $\gamma$ is an unobserved latent variable.

To examine the power of the homogeneity test in Section 4.1, we compute $\widetilde{T}_{1}$ in 200 simulated data sets where $\gamma_{i}$ ’s are simulated from Model 3, and summarize the results in Figure 3. In the top panel of Figure 3, we illustrate the true density of $\gamma$ under Model 3; in the bottom panel, we compare the empirical distribution of $\widetilde{T}_{1}$ with its asymptotic distribution under $H_{0}:C_{0}=1$ . If we perform a 5% test based on the asymptotic $\chi^{2}(2)$ distribution, the power of the homogeneity test is 91% under this scenario.

To examine the power of the locally restricted likelihood ratio test proposed in Section 4.2, we perform test on $H_{0}:C_{0}=2$ vs $H_{1}:C_{0}=3$ , while $\gamma_{i}$ ’s are simulated from Model 4. In Figure 4, we illustrate the true density of $\gamma$ under Model 4, and compare the empirical distribution of $\widetilde{T}_{2}$ over 200 simulation runs with its asymptotic null distribution. The empirical power of the proposed test is 95.5%.

We have also examined the power of the homogeneity test when $\gamma_{i}$ ’s are simulated from Model 1 and the power of the test on $H_{0}:C_{0}=2$ when $\gamma_{i}$ ’s are generated from Model 2. The power under both of these cases virtually equal to 1.

Since a sequential test can be used for model selection purpose, it is of interest to compare the test based procedure with other model selection procedures such as the Bayesian information criterion (BIC), which is the negative log likelihood for the observed data plus a penalty on $\hbox{log}(n)$ times the number of free parameters in the model. We apply BIC to simulated data under both Model 3 and 4. For Model 3, BIC picks the correct model with 2 components in 39% out of the 200 simulations and chooses a 1-component model for the remaining 61% of the repetitions. This means if we use BIC as the decision rule to test $H_{0}:C_{0}=1$ under Model 3, it only has 39% of power, which is much lower than the test we developed. For Model 4, BIC chooses a correct 3-component model in 50.5% of the 200 simulations and chooses 1 or 2 components in the other 49.5% of runs. On the other hand, the sequential test procedure with $\alpha=0.05$ chooses the correct number of components $88.5\%$ of the time for Model 3, and $86\%$ of the time for Model 4.

7 Data Analysis

Our motivating data are obtained from the Organ Procurement and Transplantation Network (OPTN), administered under a contract with the U.S. Department of Health and Human Services (HHS). The OPTN data system includes data on all donor, wait-listed candidates, and transplant recipients in the US. Included in the analysis are adult renal failure patients ( $\geq 18$ years of age) who underwent deceased donor kidney transplantation between January $1987$ and December $2008$ . This cohort includes $N=269,386$ patients receiving kidney transplants from a total of $n=296$ centers. The number of transplants performed by a center, $N_{i}$ , has a highly skewed distribution as illustrated in Figure 5. Most centers performed a few hundred cases of kidney transplantation, but there are centers took over 5000 cases. The patient level response is the 5-year survival status (1=death and -1=survival) and there is no censoring due to routine and rigorous tracking of the patients. The overall failure rate within 5 years of transplantation is $27.59\%$ .

An important patient level covariate that is directly related to the success of kidney transplant is $x_{1}=$ cold ischemic time, which is the time that the donor kidney was kept in a refrigerator before received by the patient. Other patient level covariates include $x_{2}=$ age at transplantation and $x_{3}=$ sex of the patient (1 =male, 0=female), $x_{4}$ – $x_{6}$ are indicators for BMI in the intervals (22, 25], (25-30] and 30+ respectively. Since the data were collected in a time span of two decades, it is possible that the technology used in transplant surgeries has been improving over time which also affects the patient level outcome. Therefore, we also include time effects into the model in additional to the other covariates described above. Using cases before 1990 as baseline, covariates $x_{7}$ – $x_{10}$ are indicators for cases performed in 1990-1994, 1995–1999, 2000–2003 and 2004–2008 respectively.

7.1 Model Fitting

We fit the proposed GLMM model to the OPTN data, using a random effect following a Gaussian mixture distribution to represent the care quality of a center.

Using the proposed test procedure to decide the order the latent Gaussian mixture model, the $p$ -value is 0.0016 for $H_{0}:C_{0}=1$ vs. $H_{1}:C_{0}=2$ ; and 0.4076 for $H_{0}:C_{0}=2$ vs. $H_{1}:C_{0}=3$ . We conclude that the care quality among the kidney transplant centers is not homogeneous and and the distribution of the random effect is adequately described by a two-component Gaussian mixture. The estimated fixed effects under our final model are summarized in Table 4, where the standard errors are obtained using the asymptotic expansion (S.22). As we can see, all covariates considered are significant. Since we code $Y=1$ as death, the results in Table 4 imply that patient death rate is higher if the donor kidney is not delivered to the patient fast enough, older patients have a higher death rate, men have higher death rate than women, and higher BMI also leads to higher risk. The coefficients for $x_{7}$ – $x_{10}$ are negative and decreasing in their order confirming that the overall death rate is decreasing over time.

The estimated Gaussian mixture model for the random effect $\gamma$ is

[TABLE]

The mixture density $g(\gamma)$ as well as the individual components are illustrated in Figure 6. The majority of the centers have rather similar care quality, but there is also a small cluster of transplant centers that have lower death rate after taking into account of all the patient level covariates. These are the centers that are out-performing the others. In Figure 7, we also compare the predicted random effects under GLMM with Gaussian random effects and those under our latent Gaussian mixture model. As we can see, for the majority of the centers, the predicted $\gamma$ is almost the same under both models, but, for the a few centers in the left tail, their care quality effects are seriously shrunk towards the mean if we assume the random effect follows a homogeneous Gaussian distribution.

Since the second component is small, we also run additional simulations to confirm that our methodology really works under such situations. To mimic the real data, we simulate binary $Y_{ik}$ from a logistic GLMM using the covariates in the real data, set ${\boldsymbol{\beta}}$ at the estimated values in Table 4 and generate ${\boldsymbol{\gamma}}$ from the following mixture model

[TABLE]

We set $\pi_{2}$ to be 0.005, 0.01, 0.02 or 0.05, and simulate 200 data sets under each setting. The empirical powers for testing $H_{0}:C_{0}=1$ are 47%, 78.5%, 97.5% and 100% respectively. These results show that our method can detect a small component under the sample size of the real data and our discovery is likely to be true.

7.2 Performance Evaluation

Based on the fitted model for $\gamma$ in Figure 6, the majority of the centers provide similar care for their patients and the smaller component consists of transplant centers with lower mortality rate, which means these centers outperform the rest. We let the empirical null distribution to be the bigger component of the fitted mixture model. Using the evaluation procedure described in Section 5, we find three transplant centers that outperforms the rest. In Table 5, we list the id of the three outperforming centers, as well as their $lFDR$ , $\widehat{\gamma}$ , number of cases treated, and their averaged 5-year survival rate.

8 SUMMARY

We propose a GLMM model with latent Gaussian mixture random effects that provides a natural framework to model the inhomogeneity among transplant centers and to rank their care quality. We demonstrate that the predicted random effects can be seriously shrunk toward the mean if the distribution of the random effect is mis-specified as Gaussian. This shrinkage effect is quite prominent for the centers in the tails of the population. The latent Gaussian mixture model is not strongly identifiable and suffers from slow convergence rate when the number of mixture component is larger than the truth. We develop test procedures to decide the number of mixture components. Even though the proposed tests are designed mainly for testing scientific claims and providing uncertainty assessments, they can also be used for model selection and our simulation results in Section 6.2.2 suggest the sequential test procedure outperforms a naive BIC. Developing a consistent model selection procedure for the latent Gaussian mixture model is our future work. The proposed test procedures are computationally intense, especially when analyzing large medical data sets like the OPTN data. This is because we have to try hundreds of initial values to find the biggest likelihood ratio. These computations are best handled using parallel computing. We have developed a software package LatentGaussianMixtureModel written in Julia (http://julialang.org/), which is a high-level, high-performance dynamic programming language. Our package is based on open source math libraries and supports parallel computing. We will make the package available on the correspondence author’s website. Even though comparing transplant centers using five-year survival rates of the patients has been the standard in the health policy literature, we acknowledge the fact that survival time is a more informative response variable. Extending the latent Gaussian mixture model to survival outcomes is also a topic for our future research.

Appendix A: Simulation Approach for the Asymptotic Distribution in Proposition 5

We use the following procedure to simulate the asymptotic distribution in Proposition 5 under the hypothesis $H_{0}:C_{0}=C$ .

Step 0. Fit a $C$ -component latent Gaussian mixture model and obtain the reduced model estimator $\widehat{{\boldsymbol{\theta}}}_{red}$ .

Step 1. Calculate $\tilde{{\boldsymbol{s}}}_{i}=({\boldsymbol{s}}_{{\boldsymbol{\eta}},i}^{\rm T},\tilde{{\boldsymbol{s}}}_{\bm{\lambda},i}^{\rm T})^{\rm T}$ with $\tilde{{\boldsymbol{s}}}_{\lambda,i}=\{({\boldsymbol{s}}_{\lambda,i}^{(1)})^{\rm T},({\boldsymbol{s}}_{\lambda,i}^{(2)})^{\rm T},\ldots({\boldsymbol{s}}_{\lambda,i}^{(C)})^{\rm T}\}^{\rm T}$ , where ${\boldsymbol{s}}_{\eta,i}$ and ${\boldsymbol{s}}_{\lambda,i}^{(c)}$ , $c=1,\ldots,C$ , are the score functions for the restricted full models defined in (27) evaluated at $\widehat{{\boldsymbol{\theta}}}_{red}$ . Let

[TABLE]

be the sample version of $\tilde{\bm{{\cal I}}}=E\tilde{{\boldsymbol{s}}}_{i}\tilde{{\boldsymbol{s}}}_{i}^{T}$ , and calculate $\tilde{\bm{I}}_{\lambda|\eta}=\tilde{\bm{I}}_{\lambda}-\tilde{\bm{I}}_{\lambda\eta}\bm{I}_{\eta}^{-1}(\tilde{\bm{I}}_{\lambda\eta})^{T}$ . To improve numerical stability, we check if $\tilde{\bm{I}}$ is an ill conditioned matrix. If so, set the eigenvalues with small absolute values to be a small positive number.

Step 2. Generate random a vector ${\boldsymbol{s}}=\left\{({\boldsymbol{s}}^{(1)})^{\rm T},({\boldsymbol{s}}^{(2)})^{\rm T},\ldots,({\boldsymbol{s}}^{(C)})^{\rm T}\right\}^{\rm T}\sim\hbox{Normal}(0,\tilde{\bm{I}}_{\lambda|\eta})$ . Let $\bm{I}_{\lambda|\eta}^{(c)}$ be the sub diagonal matrix of $\tilde{\bm{I}}_{\lambda|\eta}$ corresponding to ${\boldsymbol{s}}^{(c)}$ . Then

[TABLE]

has the same asymptotic distribution as $T_{C}(\tau)$ and $\widetilde{T}_{C}$ .

Step 3. Repeat Step 2 a large number of times and use the empirical distribution of $T_{C}^{\ast}$ to approximate the asymptotic distribution of $\widetilde{T}_{C}$ .

Appendix S.1 Assumptions and Consistency of the Estimator

S.1.1 Assumptions

For simplicity, assume $N_{i}=n_{0}$ for $i=1,\ldots,n$ . Let $({\boldsymbol{X}},{\boldsymbol{Y}})$ be a generic copy of $({\boldsymbol{X}}_{i},{\boldsymbol{Y}}_{i})$ and have a density

[TABLE]

where ${\boldsymbol{y}}=(y_{1},\ldots,y_{n_{0}})^{\rm T}$ , ${\boldsymbol{x}}=({\boldsymbol{x}}_{1},\ldots,{\boldsymbol{x}}_{n_{0}})^{\rm T}$ and $f({\boldsymbol{x}})$ is the joint density of ${\boldsymbol{X}}$ . Define metric

[TABLE]

where $\theta_{l}$ is the $l$ -th entry of ${\boldsymbol{\theta}}$ . All convergences in the parameter space are defined with respect to $\delta$ .

Assumptions 1- 5 below are equivalent to those in Kiefer and Wolfowitz (1956) and Hathaway (1985) for the consistency result. Assumption 6 is a regularity assumption on the penalty function used in Chen, Tan and Zhang (2008) and Kasahara and Shimotsu (2015). Assumption 7 and 8 are additional assumptions for Propositions 2 and 4 respectively.

Assumption 1.

$f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}})$ * is a density (the Radon-Nikodym derivative of a probability measure) with respect to a $\sigma$ -finite measure $\mu$ on the space of $({\boldsymbol{x}},{\boldsymbol{y}})$ .*

Assumption 2 (Continuity Assumption).

The definition of $f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}})$ can be extended to the closure of the parameter space $\bar{\Theta}_{C}$ such that, for any ${\boldsymbol{\theta}}^{*}$ in $\bar{\Theta}_{C}$ and any Cauchy sequence $\{{\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2},\ldots\}\subset\bar{\Theta}_{C}$ , $f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}}_{i})\rightarrow f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}}^{*})$ if ${\boldsymbol{\theta}}_{i}\rightarrow{\boldsymbol{\theta}}^{*}$ .

Assumption 3.

For any ${\boldsymbol{\theta}}\in\bar{\Theta}_{C}$ and any $\rho>0$ , $\omega({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}},\rho)$ is a measurable function of $({\boldsymbol{x}},{\boldsymbol{y}})$ , where

[TABLE]

the supreme being taken over all ${\boldsymbol{\theta}}^{\prime}$ in $\bar{\Theta}_{C}$ for which $\delta({\boldsymbol{\theta}}^{\prime},{\boldsymbol{\theta}})<\rho$ .

Assumption 4 (Identifiability Assumption).

Identify $\bar{\Theta}_{C}$ as the quotient topological space such that ${\cal F}$ defined in (10) is identified as a single point.

Assumption 5.

For any ${\boldsymbol{\theta}}^{\prime}$ in $\bar{\Theta}_{C}$ ,

[TABLE]

where $E_{\boldsymbol{\theta}}$ is the expectation under $f({\boldsymbol{x}},{\boldsymbol{y}}|{\boldsymbol{\theta}})$ .

Assumption 6.

The penalty function satisfies, (a) $\sup_{\sigma^{2}>0}\max\{0,p_{n}(\sigma^{2})\}=o(n)$ , $p_{n}(\sigma^{2})=o(n)$ for any fixed $\sigma^{2}$ ; (b) for any $\sigma\in(0,8/(nM)]$ , $p_{n}(\sigma^{2})\leq 5\{ln(n)\}^{2}\ln(\sigma)$ for sufficient large $n$ , where $M=\sup_{{\boldsymbol{x}},{\boldsymbol{y}}}f({\boldsymbol{y}}|{\boldsymbol{x}};{\boldsymbol{\theta}}_{0})$ ; (c) $p_{n}^{\prime}(\sigma^{2})=o_{p}(n^{1/4})$ for any fixed $\sigma^{2}$ .

Assumption 7.

When the true number of component is $C_{0}=1$ , assume that $\bm{{\cal I}}=E\bm{I}_{n}$ is a finite, positive definite matrix, where $\bm{I}_{n}$ is defined in (S.11).

Assumption 8.

When ${\boldsymbol{\theta}}\in\Theta_{C}$ , assume that $\bm{{\cal I}}^{(c)}$ defined in (S.21) is positive definite, for $c=1,2,\ldots,C$ .

Remarks:

1. The continuity assumption (Assumption 2) is not satisfied by the finite Gaussian mixture model on the boundary of the parameters space, since the likelihood diverges $\infty$ if any $\sigma_{c}^{2}\to 0$ . That is the reason that Hathaway (1985) restricted the estimation in the interior of the parameter space. However, in our problem, the finite Gaussian mixture density $g(\gamma)$ is convoluted with proper density $f({\boldsymbol{y}}|{\boldsymbol{x}},\gamma)$ in (S.1). Since the integral is bounded, unbounded likelihood is no longer a concern and the condition is satisfied even on boundary points of $\bar{\Theta}_{C}$ .

2. Assumption 4 is a modified version of the identifiability assumption in Kiefer and Wolfowitz (1956). The same assumption is used in Hathaway (1985). The consistency result in Proposition 1 means consistently estimating the mixture density.

S.1.2 Proof of Proposition 1

Using similar arguments as in Chen, Tan and Zhang (2008) one can show, as long as the penalty function satisfies Assumption 6, the maximizer of (7) is restricted in an interior region of the parameter space $\bar{\Theta}(\epsilon)=\{{\boldsymbol{\theta}}\in\bar{\Theta};\min_{c}\sigma_{c}^{2}\geq\epsilon\}$ for some positive constant $\epsilon$ . Since the penalty term is of order $o(n)$ , which is much smaller than the likelihood function, the maximum penalized likelihood estimator $\widehat{\boldsymbol{\theta}}$ in the restricted parameter space belong to the class of modified maximum likelihood estimator in Kiefer and Wolfowitz (1956) and the strong consistency of $\widehat{\boldsymbol{\theta}}$ follows from their theory.

Appendix S.2 Proof of Proposition 2

Denote for convenience $\zeta_{i}=\prod_{k=1}^{n_{0}}f(y_{ik}|{\boldsymbol{x}}_{ik},\gamma_{i};{\boldsymbol{\theta}}_{y}).$ After fixing $\pi_{1}=\tau$ , the log likelihood is

[TABLE]

We adopt the re-parameterization of Kasahara and Shimotsu (2015),

[TABLE]

collect all parameters except $\tau$ into ${\boldsymbol{\psi}}(\tau)=({\boldsymbol{\eta}}^{\rm T},\bm{\lambda}^{\rm T})^{\rm T}$ , where ${\boldsymbol{\eta}}=({\boldsymbol{\theta}}_{y}^{\rm T},\nu_{\mu},\nu_{\sigma})^{\rm T}$ and $\bm{\lambda}=(\lambda_{\mu},\lambda_{\sigma})^{\rm T}$ . Denote $\bar{\Theta}_{\psi}(\tau)$ as the parameter space of ${\boldsymbol{\psi}}$ corresponding to $\bar{\Theta}_{2}(\tau)$ . Sometimes we suppress the dependence of ${\boldsymbol{\psi}}(\tau)$ on $\tau$ . Under the null hypothesis $C_{0}=1$ , $\lambda_{\mu}=\lambda_{\sigma}=0$ and the true parameter vector is ${\boldsymbol{\psi}}^{*}=(({\boldsymbol{\eta}}^{*})^{\rm T},0,0)^{\rm T}$ .

For any multivariate function $f({\boldsymbol{x}})$ , denote $\nabla_{{\boldsymbol{x}}^{k}}f$ as its $k$ -th derivative, which is a multidimensional array. By similar calculations as in Proposition C and equation (29) in the supplementary appendix of Kasahara and Shimotsu (2015), we can show

[TABLE]

Denote $g^{\ast}(\gamma)=g(\gamma;{\boldsymbol{\psi}}^{\ast})$ as the true density of $\gamma$ under the null hypothesis. Using a ninth order Taylor expansion of $l_{pen}$ around ${\boldsymbol{\psi}}^{\ast}$ as in Kasahara and Shimotsu (2015), we get the following local quadratic approximation to the penalized likelihood

[TABLE]

where ${\boldsymbol{t}}_{n}({\boldsymbol{\psi}},\tau)=({\boldsymbol{t}}_{{\boldsymbol{\eta}},n},{\boldsymbol{t}}_{\bm{\lambda},n})^{\rm T}$ , $\bm{S}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{s_{i}}$ , $\bm{I}_{n}=\frac{1}{n}\sum_{i=1}^{n}\bm{s_{i}}\bm{s_{i}}^{\rm T}$ , ${\boldsymbol{s}}_{i}=({\boldsymbol{s}}_{{\boldsymbol{\eta}},i}^{\rm T},{\boldsymbol{s}}_{\bm{\lambda},i}^{\rm T})^{\rm T}$ , $\sigma_{c}^{2}({\boldsymbol{\psi}},\tau)$ is the variance as a function of ${\boldsymbol{\psi}}$ defined by the reparameterization in (S.10),

[TABLE]

Here,

[TABLE]

where $H^{k}(x)$ is the $k$ th order Hermite polynomial, e.g. $H^{0}(x)=1$ , $H^{1}(x)=x$ , $H^{2}(x)=x^{2}-1$ , $H^{3}(x)=x^{3}-3x$ and $H^{4}(x)=x^{4}-6x^{2}+3$ .

By consistency of the estimator, we can focus on ${\boldsymbol{\psi}}$ such that $\|{\boldsymbol{\psi}}-{\boldsymbol{\psi}}^{\ast}\|=o_{p}(1)$ and hence $R_{n}({\boldsymbol{\psi}},\tau)=o_{p}(\|{\boldsymbol{t}}_{n}({\boldsymbol{\psi}},\tau)\|^{2})$ . By Assumption 6, $p_{n}^{\prime}(\sigma^{2})=o_{p}(n^{1/4})$ , and by (S.10)

[TABLE]

Therefore, $l_{pen}({\boldsymbol{\psi}},\tau)-l_{pen}({\boldsymbol{\psi}}^{*},\tau)$ is dominated by the quadratic function defined by the first two terms on the right hand side of (S.11). It is then easy to see $\widehat{\boldsymbol{t}}_{n}={\boldsymbol{t}}_{n}\{\widehat{\boldsymbol{\psi}}(\tau),\tau\}$ that maximizes $l_{pen}({\boldsymbol{\psi}},\tau)-l_{pen}({\boldsymbol{\psi}}^{*},\tau)$ is given by

[TABLE]

Under Assumption 7, $\bm{{\cal I}}=E\bm{I}_{n}$ is a positive definite matrix. By the law of large numbers, ${\boldsymbol{I}}_{n}\to{\boldsymbol{\cal I}}$ in probability. On the other hand, by the central limit theorem, ${\boldsymbol{S}}_{n}\to\hbox{Normal}(\boldsymbol{0},{\boldsymbol{\cal I}})$ in distribution. Therefore, $\widehat{\boldsymbol{t}}_{n}\to\hbox{Normal}(\boldsymbol{0},{\boldsymbol{\cal I}}^{-1})$ in distribution, which also implies

[TABLE]

The convergence rate of $\widehat{\boldsymbol{\theta}}_{\gamma,full}(\tau)$ is determined by those of $\widehat{\lambda}_{\mu}$ and $\widehat{\lambda}_{\sigma}$ .

Appendix S.3 Proof of Proposition 3

Following arguments in Section S.2, we have

[TABLE]

in distribution, where $\bm{{\cal I}}=E\bm{I}_{n}$ . Under the full model, for any ${\boldsymbol{\psi}}$ such that ${\boldsymbol{t}}_{n}=O_{p}(1)$ , using the local quadratic approximation (S.11) we have

[TABLE]

Let $\widehat{\boldsymbol{\psi}}_{full}(\tau)$ be maximizer of (S.11) under the full model with 2 components, and it is the reparameterized version of $\widehat{\boldsymbol{\theta}}_{full}(\tau)$ . By (S.14), ${\boldsymbol{t}}_{n}\{\widehat{\boldsymbol{\psi}}_{full}(\tau)\}={\boldsymbol{\cal I}}^{-1}{\boldsymbol{S}}_{n}+o_{p}(1)$ and hence

[TABLE]

Partition ${\boldsymbol{S}}_{n}$ into $\left(\begin{array}[]{c}{\boldsymbol{S}}_{\eta,n}\\ {\boldsymbol{S}}_{\lambda,n}\end{array}\right)$ according to the partition of ${\boldsymbol{\psi}}$ . With a similar partition to ${\boldsymbol{\cal I}}$ , we have

[TABLE]

where $\bm{\mathcal{I}}_{\lambda|\eta}=\bm{\mathcal{I}}_{\lambda}-\bm{\mathcal{I}}_{\lambda\eta}\bm{\mathcal{I}}_{\eta}^{-1}\bm{\mathcal{I}}_{\eta\lambda}$ . Define

[TABLE]

and by simple algebra

[TABLE]

Under the reduced model, ${\boldsymbol{\lambda}}={\boldsymbol{0}}$ , and hence ${\boldsymbol{t}}_{\lambda n}={\boldsymbol{S}}_{\lambda n}={\boldsymbol{0}}$ . Using the same local quadratic approximation, for a parameter vector ${\boldsymbol{\psi}}_{red}$ in the reduced model,

[TABLE]

Let $\widehat{\boldsymbol{\psi}}_{red}$ be the estimator that maximizes the reduced model penalized likelihood, then ${\boldsymbol{t}}_{\eta n}(\widehat{\boldsymbol{\psi}}_{red})={\boldsymbol{\cal I}}_{\eta}^{-1}{\boldsymbol{S}}_{\eta n}+o_{p}(1)$ , and

[TABLE]

Combining (S.15), (S.17) and (S.18),

[TABLE]

Because ${\boldsymbol{S}}_{\lambda|\eta,n}$ and ${\boldsymbol{\cal I}}_{\lambda|\eta}$ do not depend on $\tau$ ,

[TABLE]

Appendix S.4 Proof of Proposition 4

Denote $\zeta_{i}=\prod_{k=1}^{n_{0}}f(y_{ik}|{\boldsymbol{x}}_{ik},\gamma_{i};{\boldsymbol{\theta}}_{y})$ as in Section S.2. Under the local reparameterization in ${\cal N}_{C+1}(c,\tau)$ defined in (22) and (26) in Section 4.2, the log likelihood is

[TABLE]

where

[TABLE]

The score function with respect to ${\boldsymbol{\psi}}(c,\tau)$ is ${\boldsymbol{s}}^{(c)}_{i}=({\boldsymbol{s}}_{{\boldsymbol{\eta}},i}^{\rm T},({\boldsymbol{s}}_{\bm{\lambda},i}^{(c)})^{\rm T})^{\rm T}$ , which is defined in (27). Define $\bm{S}_{n}^{(c)}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}{\boldsymbol{s}}_{i}^{(c)}$ , $\bm{I}_{n}^{(c)}=\frac{1}{n}\sum_{i=1}^{n}{\boldsymbol{s}}_{i}^{(c)}({\boldsymbol{s}}_{i}^{(c)})^{\rm T}$ and ${\boldsymbol{t}}_{n}({\boldsymbol{\psi}}(c,\tau),\tau)=({\boldsymbol{t}}_{{\boldsymbol{\eta}},n},{\boldsymbol{t}}_{\bm{\lambda},n})^{\rm T}$ where

[TABLE]

Similar to (S.11), we can derive a local quadratic approximation to the likelihood

[TABLE]

where $R_{n}({\boldsymbol{\psi}},\tau)=[O(\|{\boldsymbol{\psi}}-{\boldsymbol{\psi}}^{*}\|)+o(1)]\times O_{p}[\{1+\|{\boldsymbol{t}}_{n}({\boldsymbol{\psi}},\tau)\|^{2}\}]$ .

Put $\widehat{{\boldsymbol{\psi}}}_{full}(c,\tau)=\arg\max_{{\boldsymbol{\psi}}(c,\tau)\in\Theta_{\psi}(c,\tau)}l_{pen}\left({\boldsymbol{\psi}}(c,\tau),\tau\right)$ and $\widehat{{\boldsymbol{t}}}_{n}={\boldsymbol{t}}_{n}\left(\widehat{{\boldsymbol{\psi}}}_{full}(c,\tau),\tau\right)$ . Using similar arguments as in Section S.2, we can show that the penalty function is asymptotically negligible when ${\boldsymbol{\psi}}(c,\tau)$ is in a consistent neighborhood of ${\boldsymbol{\psi}}^{\ast}$ . Define

[TABLE]

which is positive definite under Assumption 8. It is then easy to see that

[TABLE]

By the definition of ${\boldsymbol{t}}_{n}\{{\boldsymbol{\psi}}(c,\tau),\tau\}$ , we get $\widehat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}^{\ast}=O_{p}(n^{-1/2})$ , $\widehat{\lambda}_{\mu}=O_{p}(n^{-1/4})$ and $\widehat{\lambda}_{\sigma}=O_{p}(n^{-1/4})$ . Since the convergence rates for $\widehat{\mu}_{c,full}(c,\tau)$ , $\widehat{\mu}_{c+1,full}(c,\tau)$ , $\widehat{\sigma}_{c,full}(c,\tau)$ and $\widehat{\sigma}_{c+1,full}(c,\tau)$ are determined by $\widehat{\lambda}_{\mu}$ and $\widehat{\lambda}_{\sigma}$ , they converge to the true parameters in a slower $O_{p}(n^{-1/4})$ rate and the rest of the parameters in $\widehat{\boldsymbol{\theta}}_{full}(c,\tau)$ converge in a $O_{p}(n^{-1/2})$ rate.

Appendix S.5 Proof of Proposition 5

We first derive the asymptotic properties for $T_{C}(c,\tau)$ . By (S.20) and (S.22),

[TABLE]

where ${\boldsymbol{S}}_{n}^{(c)}\buildrel d\over{\longrightarrow}\hbox{Normal}({\boldsymbol{0}},{\boldsymbol{\cal I}}^{(c)})$ by the central limit theorem.

Note that the reduced model estimator $\widehat{\boldsymbol{\psi}}_{red}(c,\tau)$ is obtained by minimizing the penalized likelihood while restricting $\lambda_{\mu}=\lambda_{\sigma}=0$ . by similar derivations under the full model, we get

[TABLE]

where ${\boldsymbol{S}}_{\eta,n}$ and ${\boldsymbol{\cal I}}_{\eta}$ are sub-vector or sub-matrix of ${\boldsymbol{S}}_{n}^{(c)}$ and ${\boldsymbol{\cal I}}^{(c)}$ as defined in Proposition 5.

Using algebra similar to that in Section S.3, we get

[TABLE]

Therefore,

[TABLE]

Since none of the quantities $(\bm{S}_{\lambda|\eta,n}^{(c)})^{\rm T}(\bm{{\cal I}}_{\lambda|\eta}^{(c)})^{-1}\bm{S}_{\lambda|\eta,n}^{(c)}$ depends on $\tau$ , $\widetilde{T}_{C}$ that maximizes $T_{C}(\tau)$ over any set ${\cal T}$ has the same limiting distribution.

Appendix S.6 Proof of Proposition 6

The FDR for the described procedure is

[TABLE]

Appendix S.7 Computation Details

We now provide more details on the Gauss-Hermite Approximation used in Section 3. The EM loss function is

[TABLE]

where

[TABLE]

Let $\{d_{m}\}_{m=1}^{M}$ and $\{w_{m}\}_{m=1}^{M}$ be Gauss-Hermite abscissas and weights, and denote $\gamma^{(c,m)}=\mu_{c}^{(t-1)}+\sqrt{2}\sigma_{c}^{(t-1)}d_{m}$ . The Gauss-Hermite approximation for $Q({\boldsymbol{\theta}}|{\boldsymbol{\theta}}^{(t-1}))$ is

[TABLE]

where

[TABLE]

as defined in (5). Maximizing $\widehat{Q}({\boldsymbol{\theta}}|{\boldsymbol{\theta}}^{(t-1)})$ with respect to different components of ${\boldsymbol{\theta}}$ results in the updating scheme in Section 3.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ash et al. (2012) {barticle} [author] \bauthor \bsnm Ash, \bfnm Arlene S \binits A. S., \bauthor \bsnm Fienberg, \bfnm Stephen E \binits S. E., \bauthor \bsnm Louis, \bfnm Thomas A \binits T. A., \bauthor \bsnm Normand, \bfnm Sharon-Lise T \binits S.-L. T., \bauthor \bsnm Stukel, \bfnm Therese A. \binits T. A. and \bauthor \bsnm Utts, \bfnm Jessica \binits J. ( \byear 2012). \btitle Statistical Issues In Assessing Hospital Performance. \bjournal Quantitative Health Sciences Pu
2Booth and Hobert (1999) {barticle} [author] \bauthor \bsnm Booth, \bfnm J. G. \binits J. G. and \bauthor \bsnm Hobert, \bfnm J. P. \binits J. P. ( \byear 1999). \btitle Maximizing Generalized Linear Mixed Model Likelihoods with an Automated Monte Carlo EM Algorithm. \bjournal Journal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume 61 \bpages 265–285. \bdoi 10.1111/1467-9868.00176 \endbibitem
3Breslow and Clayton (1993) {barticle} [author] \bauthor \bsnm Breslow, \bfnm Norman E \binits N. E. and \bauthor \bsnm Clayton, \bfnm D G \binits D. G. ( \byear 1993). \btitle Approximate Inference in Generalized Linear Mixed Models. \bjournal Journal of the American Statistical Association \bvolume 88 \bpages 9–25. \endbibitem
4Caffo, An and Rohde (2007) {barticle} [author] \bauthor \bsnm Caffo, \bfnm Brian \binits B., \bauthor \bsnm An, \bfnm Ming-Wen \binits M.-W. and \bauthor \bsnm Rohde, \bfnm Charles \binits C. ( \byear 2007). \btitle Flexible Random Intercept Models for Binary Outcomes Using Mixtures of Normals. \bjournal Computational statistics & data analysis \bvolume 51 \bpages 5220–5235. \bdoi 10.1016/j.csda.2006.09.031 \endbibitem
5Chen (1995) {barticle} [author] \bauthor \bsnm Chen, \bfnm Jiahua \binits J. ( \byear 1995). \btitle Optimal Rate of Convergence for Finite Mixture Models. \bjournal The Annals of Statistics \bvolume 23 \bpages 221–233. \endbibitem
6Chen and Li (2009) {barticle} [author] \bauthor \bsnm Chen, \bfnm Jiahua \binits J. and \bauthor \bsnm Li, \bfnm Pengfei \binits P. ( \byear 2009). \btitle Hypothesis Test for Normal Mixture Models: The EM Approach. \bjournal The Annals of Statistics \bvolume 37 \bpages 2523–2542. \bdoi 10.1214/08-AOS 651 \endbibitem
7Chen, Li and Fu (2012) {barticle} [author] \bauthor \bsnm Chen, \bfnm Jiahua \binits J., \bauthor \bsnm Li, \bfnm Pengfei \binits P. and \bauthor \bsnm Fu, \bfnm Yuejiao \binits Y. ( \byear 2012). \btitle Inference on the Order of a Normal Mixture. \bjournal Journal of the American Statistical Association \bvolume 107 \bpages 1096–1105. \bdoi 10.1080/01621459.2012.695668 \endbibitem
8Chen, Tan and Zhang (2008) {barticle} [author] \bauthor \bsnm Chen, \bfnm Jiahua \binits J., \bauthor \bsnm Tan, \bfnm X \binits X. and \bauthor \bsnm Zhang, \bfnm R \binits R. ( \byear 2008). \btitle Inference for Normal Mixtures in Mean and Variance. \bjournal Statistica Sinica \bvolume 18 \bpages 443–465. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Latent Gaussian Mixture Models for Nationwide Kidney Transplant Center Evaluation

Abstract

keywords:

1 Introduction

2 Model and Assumptions

3 Estimation Procedure

3.1 EM algorithm with Gauss-Hermite quadrature

3.2 Consistency of the estimator

Proposition** 1****.**

4 Deciding the Number of Mixture Components

4.1 Homogeneity Test

4.1.1 Asymptotic Behavior of the Estimators

Proposition** 2****.**

4.1.2 Test Procedure

Proposition** 3****.**

4.2 Testing for C greater than 2

4.2.1 Locally Restricted Full Model Estimators

Proposition** 4****.**

4.2.2 Local Reparameterization, Test Statistic and Asymptotics

Proposition** 5****.**

4.2.3 Test Procedure

4.3 Sequential Test to Determine the Order of the Latent Gaussian Mixture Model

5 Transplant Center Evaluation with False Discovery Rate Control

Proposition** 6****.**

6 Simulation Studies

6.1 Simulation 1: Estimation and Random Effect Prediction

6.1.1 Two-Component Model

6.1.2 Three-Component Model

6.2 Simulation 2: Hypothesis Tests

6.2.1 Asymptotic Null Distributions

6.2.2 Power of the tests

7 Data Analysis

7.1 Model Fitting

7.2 Performance Evaluation

8 SUMMARY

Appendix A: Simulation Approach for the Asymptotic Distribution in Proposition 5

Appendix S.1 Assumptions and Consistency of the Estimator

S.1.1 Assumptions

Assumption** 1****.**

Assumption** 2**** (Continuity Assumption).**

Assumption** 3****.**

Assumption** 4**** (Identifiability Assumption).**

Assumption** 5****.**

Assumption** 6****.**

Assumption** 7****.**

Assumption** 8****.**

S.1.2 Proof of Proposition 1

Appendix S.2 Proof of Proposition 2

Appendix S.3 Proof of Proposition 3

Appendix S.4 Proof of Proposition 4

Appendix S.5 Proof of Proposition 5

Appendix S.6 Proof of Proposition 6

Appendix S.7 Computation Details

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5.

Proposition 6.

Assumption 1.

Assumption 2 (Continuity Assumption).

Assumption 3.

Assumption 4 (Identifiability Assumption).

Assumption 5.

Assumption 6.

Assumption 7.

Assumption 8.