Beyond the EM Algorithm: Constrained Optimization Methods for Latent   Class Model

Hao Chen; Lanshan Han; Alvin Lim

arXiv:1901.02928·stat.ML·March 23, 2021

Beyond the EM Algorithm: Constrained Optimization Methods for Latent Class Model

Hao Chen, Lanshan Han, Alvin Lim

PDF

TL;DR

This paper introduces constrained optimization methods, specifically quasi-Newton techniques, as efficient alternatives to the EM algorithm for latent class models, achieving faster convergence and more accurate estimators.

Contribution

It proposes and evaluates quasi-Newton constrained optimization methods for latent class models, improving convergence speed and estimator accuracy over traditional EM algorithms.

Findings

01

Faster convergence than EM algorithm.

02

More accurate model estimators.

03

Effective in simulation studies.

Abstract

Latent class model (LCM), which is a finite mixture of different categorical distributions, is one of the most widely used models in statistics and machine learning fields. Because of its non-continuous nature and the flexibility in shape, researchers in practice areas such as marketing and social sciences also frequently use LCM to gain insights from their data. One likelihood-based method, the Expectation-Maximization (EM) algorithm, is often used to obtain the model estimators. However, the EM algorithm is well-known for its notoriously slow convergence. In this research, we explore alternative likelihood-based methods that can potential remedy the slow convergence of the EM algorithm. More specifically, we regard likelihood-based approach as a constrained nonlinear optimization problem, and apply quasi-Newton type methods to solve them. We examine two different constrained…

Tables23

Table 1. Table 1 : Example Bundle 1 1 1 , N = 500 𝑁 500 N=500 ; the best result based on the log-likelihood among the 10 10 10 runs for each method.

(A) $d = 1, K = 2$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 335.29$	$- 335.25$	$- 335.25$	$- 335.25$
Number of Iterations	N.A.	$8$	$4$	$4$
(B) $d = 1, K = 3$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 344.49$	$- 344.45$	$- 344.45$	$- 344.45$
Number of Iterations	N.A.	$9$	$5$	$4$
(C) $d = 2, K = 2$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 661.30$	$- 659.75$	$- 659.75$	$- 659.75$
Number of Iterations	N.A.	$43$	$13$	$12$
(D) $d = 4, K = 2$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 1323.29$	$- 1323.91$	$- 1323.80$	$- 1324.46$
Number of Iterations	N.A.	$88$	$31$	$30$

Table 2. Table 2 : Example Bundle 2 2 2 , N = 1000 𝑁 1000 N=1000 ; the best result based on the log-likelihood values among the 10 10 10 runs for each method.

(A) $d = 2, K = 2$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 1656.59$	$- 1654.86$	$- 1654.86$	$- 1655.06$
Number of Iterations	N.A.	$65$	$17$	$21$
(B) $d = 2, K = 3$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 1679.85$	$- 1677.68$	$- 1677.68$	$- 1677.85$
Number of Iterations	N.A.	$75$	$21$	$22$
(C) $d = 3, K = 2$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 2156.11$	$- 2153.60$	$- 2153.73$	$- 2153.74$
Number of Iterations	N.A.	$165$	$29$	$32$
(D) $d = 3, K = 3$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 2274.72$	$- 2272.53$	$- 2272.53$	$- 2272.70$
Number of Iterations	N.A.	$169$	$34$	$35$

Table 3. Table 3 : Example Bundle 3 3 3 , N = 2000 𝑁 2000 N=2000 ; the best result based on the log-likelihood values among the 10 10 10 runs for each method.

(A) $d = 3, K = 3$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 3888.77$	$- 3887.30$	$- 3887.30$	$- 3887.30$
Number of Iterations	N.A.	$355$	$23$	$28$
(B) $d = 3, K = 4$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 3906.01$	$- 3905.22$	$- 3905.22$	$- 3905.38$
Number of Iterations	N.A.	$464$	$26$	$34$
(C) $d = 4, K = 3$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 5241.99$	$- 5237.39$	$- 5237.37$	$- 5237.41$
Number of Iterations	N.A.	$526$	$51$	$46$
(D) $d = 5, K = 3$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 6437.05$	$- 6431.05$	$- 6431.07$	$- 6431.07$
Number of Iterations	N.A.	$533$	$53$	$48$

Table 4. Table 4 : Example Bundle 4 4 4 , the best result based on the log-likelihood values among the 10 10 10 runs for each method.

(A) $d = 4, K = 4$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 13105.82$	$- 13093.78$	$- 13093.98$	$- 13093.92$
Number of Iterations	N.A.	$837$	$43$	$42$
(B) $d = 4, K = 5$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 13335.47$	$- 13328.15$	$- 13327.94$	$- 13328.02$
Number of Iterations	N.A.	$852$	$48$	$40$
(C) $d = 5, K = 4$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 16336.15$	$- 16325.38$	$- 16325.18$	$- 16326.06$
Number of Iterations	N.A.	$1028$	$76$	$73$
(D) $d = 5, K = 5$
	True Parameters	EM	SQP	Projected QN
Log-Likelihood	$- 16684.59$	$- 16670.12$	$- 16669.97$	$- 16670.17$
Number of Iterations	N.A.	$1038$	$82$	$80$

Table 5. Table 5 : Performance of the three methods based on 10 10 10 runs for the application example.

	BayesLCA	EM	SQP	Projected QN
Log-Likelihood	$- 781.8063$	$- 745.7291$	$- 744.9672$	$- 746.8557$
Number of Iterations	N.A.	$302$	$44$	$50$

Table 6. Table 6 : Pairwise root mean squared error (RMSE) of the four methods considered for the application example.

	BayesLCA	EM	SQP	Projected QN
BayesLCA	$0$	$0.237$	$0.241$	$0.233$
EM	$0.237$	$0$	$0.029$	$0.046$
SQP	$0.241$	$0.029$	$0$	$0.045$
Projected QN	$0.233$	$0.046$	$0.045$	$0$

Table 7. Table 7 : Comparison of CPU time (in seconds).

	EM	SQP	Projected QN
CPU Time per Iteration	0.08	0.31	0.39
Number of Iterations	302	44	50
Overall Runtime	24.2	13.6	19.5

Table 8. Table 8 : True Weights and Categorical Parameters for Example Bundle 1, d = 1 , K = 2 formulae-sequence 𝑑 1 𝐾 2 d=1,K=2 .

	weights	$d = 1$
		$0$	$1$
$K = 1$	$0.5$	$0.4$	$0.6$
$K = 2$	$0.5$	$0.8$	$0.2$

Table 9. Table 9 : True Weights and Categorical Parameters for Example Bundle 1, d = 1 , K = 3 formulae-sequence 𝑑 1 𝐾 3 d=1,K=3 .

	weights	$d = 1$
		$0$	$1$
$K = 1$	$0.5$	$0.4$	$0.6$
$K = 2$	$0.3$	$0.8$	$0.2$
$K = 3$	$0.2$	$0.1$	$0.9$

Table 10. Table 10 : True Weights and Categorical Parameters for Example Bundle 1, d = 2 , K = 2 formulae-sequence 𝑑 2 𝐾 2 d=2,K=2 .

	weights	$d = 1$		$d = 2$
		$0$	$1$	$0$	$1$
$K = 1$	$0.5$	$0.4$	$0.6$	$0.1$	$0.9$
$K = 2$	$0.5$	$0.8$	$0.2$	$0.6$	$0.4$

Table 11. Table 11 : True Weights and Categorical Parameters for Example Bundle 1, d = 4 , K = 2 formulae-sequence 𝑑 4 𝐾 2 d=4,K=2 .

	weights	$d = 1$		$d = 2$		$d = 2$		$d = 2$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.5$	$0.4$	$0.6$	$0.1$	$0.9$	$0.5$	$0.5$	$0.6$	$0.4$
$K = 2$	$0.5$	$0.8$	$0.2$	$0.6$	$0.4$	$0.4$	$0.6$	$0.7$	$0.3$

Table 12. Table 12 : True Weights and Categorical Parameters for Example Bundle 2, d = 2 , K = 2 formulae-sequence 𝑑 2 𝐾 2 d=2,K=2 .

	weights	$d = 1$		$d = 2$
		$0$	$1$	$0$	$1$	$2$
$K = 1$	$0.4$	$0.1$	$0.9$	$0.8$	$0.1$	$0.1$
$K = 2$	$0.6$	$0.8$	$0.2$	$0.3$	$0.4$	$0.3$

Table 13. Table 13 : True Weights and Categorical Parameters for Example Bundle 2, d = 2 , K = 3 formulae-sequence 𝑑 2 𝐾 3 d=2,K=3 .

	weights	$d = 1$		$d = 2$
		$0$	$1$	$0$	$1$	$2$
$K = 1$	$0.4$	$0.1$	$0.9$	$0.8$	$0.1$	$0.1$
$K = 2$	$0.4$	$0.8$	$0.2$	$0.3$	$0.4$	$0.3$
$K = 3$	$0.2$	$0.6$	$0.4$	$0.5$	$0.3$	$0.2$

Table 14. Table 14 : True Weights and Categorical Parameters for Example Bundle 2, d = 3 , K = 2 formulae-sequence 𝑑 3 𝐾 2 d=3,K=2 .

	weights	$d = 1$		$d = 2$			$d = 3$
		$0$	$1$	$0$	$1$	$2$	$0$	$1$
$K = 1$	$0.4$	$0.1$	$0.9$	$0.8$	$0.1$	$0.1$	$0.6$	$0.4$
$K = 2$	$0.6$	$0.8$	$0.2$	$0.3$	$0.4$	$0.3$	$0.9$	$0.1$

Table 15. Table 15 : True Weights and Categorical Parameters for Example Bundle 2, d = 3 , K = 3 formulae-sequence 𝑑 3 𝐾 3 d=3,K=3 .

	weights	$d = 1$		$d = 2$			$d = 3$
		$0$	$1$	$0$	$1$	$2$	$0$	$1$
$K = 1$	$0.4$	$0.1$	$0.9$	$0.8$	$0.1$	$0.1$	$0.6$	$0.4$
$K = 2$	$0.4$	$0.8$	$0.2$	$0.3$	$0.4$	$0.3$	$0.9$	$0.1$
$K = 3$	$0.2$	$0.6$	$0.4$	$0.6$	$0.3$	$0.1$	$0.2$	$0.8$

Table 16. Table 16 : True Weights and Categorical Parameters for Example Bundle 3, d = 3 , K = 3 formulae-sequence 𝑑 3 𝐾 3 d=3,K=3 .

	weights	$d = 1$		$d = 2$		$d = 3$
		$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$
$K = 2$	$0.4$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$

Table 17. Table 17 : True Weights and Categorical Parameters for Example Bundle 3, d = 3 , K = 4 formulae-sequence 𝑑 3 𝐾 4 d=3,K=4 .

	weights	$d = 1$		$d = 2$		$d = 3$
		$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$
$K = 2$	$0.2$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$
$K = 4$	$0.2$	$0.5$	$0.5$	$0.9$	$0.1$	$0.2$	$0.8$

Table 18. Table 18 : True Weights and Categorical Parameters for Example Bundle 3, d = 4 , K = 4 formulae-sequence 𝑑 4 𝐾 4 d=4,K=4 .

	weights	$d = 1$		$d = 2$		$d = 3$		$d = 4$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$	$0.6$	$0.4$
$K = 2$	$0.2$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$	$0.5$	$0.5$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$	$0.7$	$0.3$
$K = 4$	$0.2$	$0.5$	$0.5$	$0.9$	$0.1$	$0.2$	$0.8$	$0.5$	$0.5$

Table 19. Table 19 : True Weights and Categorical Parameters for Example Bundle 3, d = 5 , K = 3 formulae-sequence 𝑑 5 𝐾 3 d=5,K=3 .

	weights	$d = 1$		$d = 2$		$d = 3$		$d = 4$		$d = 5$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$	$0.6$	$0.4$	$0.7$	$0.3$
$K = 2$	$0.4$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$	$0.5$	$0.5$	$0.3$	$0.7$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$	$0.9$	$0.1$	$0.2$	$0.8$

Table 20. Table 20 : True Weights and Categorical Parameters for Example Bundle 4, d = 4 , K = 4 formulae-sequence 𝑑 4 𝐾 4 d=4,K=4 .

	weights	$d = 1$		$d = 2$		$d = 3$		$d = 4$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$	$0.6$	$0.4$
$K = 2$	$0.2$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$	$0.5$	$0.5$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$	$0.7$	$0.3$
$K = 4$	$0.2$	$0.5$	$0.5$	$0.9$	$0.1$	$0.2$	$0.8$	$0.5$	$0.5$

Table 21. Table 21 : True Weights and Categorical Parameters for Example Bundle 4, d = 4 , K = 5 formulae-sequence 𝑑 4 𝐾 5 d=4,K=5 .

	weights	$d = 1$		$d = 2$		$d = 3$		$d = 4$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$	$0.6$	$0.4$
$K = 2$	$0.2$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$	$0.5$	$0.5$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$	$0.7$	$0.3$
$K = 4$	$0.1$	$0.5$	$0.5$	$0.9$	$0.1$	$0.2$	$0.8$	$0.5$	$0.5$
$K = 5$	$0.1$	$0.8$	$0.2$	$0.1$	$0.9$	$0.9$	$0.1$	$0.7$	$0.3$

Table 22. Table 22 : True Weights and Categorical Parameters for Example Bundle 4, d = 5 , K = 4 formulae-sequence 𝑑 5 𝐾 4 d=5,K=4 .

	weights	$d = 1$		$d = 2$		$d = 3$		$d = 4$		$d = 5$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$	$0.6$	$0.4$	$0.2$	$0.8$
$K = 2$	$0.2$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$	$0.5$	$0.5$	$0.8$	$0.2$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$	$0.7$	$0.3$	$0.3$	$0.7$
$K = 4$	$0.2$	$0.5$	$0.5$	$0.9$	$0.1$	$0.2$	$0.8$	$0.5$	$0.5$	$0.9$	$0.1$

Table 23. Table 23 : True Weights and Categorical Parameters for Example Bundle 4, d = 5 , K = 5 formulae-sequence 𝑑 5 𝐾 5 d=5,K=5 .

	weights	$d = 1$		$d = 2$		$d = 3$		$d = 4$		$d = 5$
		$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$	$0$	$1$
$K = 1$	$0.3$	$0.9$	$0.1$	$0.3$	$0.7$	$0.1$	$0.9$	$0.6$	$0.4$	$0.4$	$0.6$
$K = 2$	$0.2$	$0.2$	$0.8$	$0.5$	$0.5$	$0.55$	$0.45$	$0.5$	$0.5$	$0.7$	$0.3$
$K = 3$	$0.3$	$0.1$	$0.9$	$0.4$	$0.6$	$0.3$	$0.7$	$0.7$	$0.3$	$0.4$	$0.6$
$K = 4$	$0.1$	$0.5$	$0.5$	$0.9$	$0.1$	$0.2$	$0.8$	$0.5$	$0.5$	$0.8$	$0.2$
$K = 5$	$0.1$	$0.8$	$0.2$	$0.1$	$0.9$	$0.9$	$0.1$	$0.7$	$0.3$	$0.9$	$0.1$

Equations77

p (y) = k = 1 \sum K η_{k} p (y ∣ θ_{k}),

p (y) = k = 1 \sum K η_{k} p (y ∣ θ_{k}),

j = 1 \prod d l = 1 \prod c_{j} π_{k, j, l}^{I (y_{j} = l)},

j = 1 \prod d l = 1 \prod c_{j} π_{k, j, l}^{I (y_{j} = l)},

\mathcal{I}(\mbox{P})\,=\,\left\{\begin{array}[]{ll}1&\mbox{if P is true;}\\ 0&\mbox{if P is false.}\end{array}\right.

\mathcal{I}(\mbox{P})\,=\,\left\{\begin{array}[]{ll}1&\mbox{if P is true;}\\ 0&\mbox{if P is false.}\end{array}\right.

p (y ∣ θ) = k = 1 \sum K (η_{k} j = 1 \prod d l = 1 \prod c_{j} π_{k, j, l}^{I (y_{j} = l)}),

p (y ∣ θ) = k = 1 \sum K (η_{k} j = 1 \prod d l = 1 \prod c_{j} π_{k, j, l}^{I (y_{j} = l)}),

L (θ ∣ Y)

L (θ ∣ Y)

\begin{array}[]{rl}\displaystyle{\max_{\eta,\pi}}&\displaystyle{\sum_{i=1}^{N}\log\left(\sum_{k=1}^{K}\eta_{k}\prod_{j=1}^{d}\prod_{l=1}^{c_{d}}\pi_{k,j,l}^{\mathcal{I}(y^{i}_{j}=l)}\right)}\\[5.0pt] \mbox{s.t.}&\displaystyle{\sum_{k=1}^{K}\eta_{k}}\,=\,1,\\[5.0pt] &\displaystyle{\sum_{l=1}^{c_{j}}\pi_{k,j,l}}\,=\,1,\,\forall\,k=1,\cdots,K,j=1,\cdots,d,\\ &\eta_{k}\,\geq\,0,\forall\,k=1,\cdots,K,\\ &\pi_{k,j,l}\,\geq\,0,\forall\,k=1,\cdots,K,j=1,\cdots,d,l=1,\cdots,c_{j}.\end{array}

\begin{array}[]{rl}\displaystyle{\max_{\eta,\pi}}&\displaystyle{\sum_{i=1}^{N}\log\left(\sum_{k=1}^{K}\eta_{k}\prod_{j=1}^{d}\prod_{l=1}^{c_{d}}\pi_{k,j,l}^{\mathcal{I}(y^{i}_{j}=l)}\right)}\\[5.0pt] \mbox{s.t.}&\displaystyle{\sum_{k=1}^{K}\eta_{k}}\,=\,1,\\[5.0pt] &\displaystyle{\sum_{l=1}^{c_{j}}\pi_{k,j,l}}\,=\,1,\,\forall\,k=1,\cdots,K,j=1,\cdots,d,\\ &\eta_{k}\,\geq\,0,\forall\,k=1,\cdots,K,\\ &\pi_{k,j,l}\,\geq\,0,\forall\,k=1,\cdots,K,j=1,\cdots,d,l=1,\cdots,c_{j}.\end{array}

P^{n} = {x \in R_{+}^{n} i = 1 \sum n x_{i} = 1},

P^{n} = {x \in R_{+}^{n} i = 1 \sum n x_{i} = 1},

θ^{(0)} = (η_{k}^{(0)}, π_{j, k, l}^{(0)})_{k = 1, \dots, K; j = 1, \dots, d; l = 1, \dots, c_{j}}

θ^{(0)} = (η_{k}^{(0)}, π_{j, k, l}^{(0)})_{k = 1, \dots, K; j = 1, \dots, d; l = 1, \dots, c_{j}}

D_{ik}^{(t)} = \frac{η _{k}^{(t - 1)} \prod _{j = 1}^{d} \prod _{l = 1}^{c_{j}} ( π _{k, j, l}^{(t - 1)} ) ^{I (y_{j}^{i} = l)}}{\sum _{k = 1}^{K} η _{k}^{(t - 1)} \prod _{j = 1}^{d} \prod _{l = 1}^{c_{j}} ( π _{k, j, l}^{(t - 1)} ) ^{I (y_{j}^{i} = l)}} .

D_{ik}^{(t)} = \frac{η _{k}^{(t - 1)} \prod _{j = 1}^{d} \prod _{l = 1}^{c_{j}} ( π _{k, j, l}^{(t - 1)} ) ^{I (y_{j}^{i} = l)}}{\sum _{k = 1}^{K} η _{k}^{(t - 1)} \prod _{j = 1}^{d} \prod _{l = 1}^{c_{j}} ( π _{k, j, l}^{(t - 1)} ) ^{I (y_{j}^{i} = l)}} .

\overset{η}{^}_{k}^{(t)} = \frac{1}{N} i = 1 \sum N D_{ik}^{(t)} .

\overset{η}{^}_{k}^{(t)} = \frac{1}{N} i = 1 \sum N D_{ik}^{(t)} .

π_{k, j, l}^{(t)} = \frac{\sum _{i = 1}^{n} D _{ik}^{(t)} [ \sum _{j = 1}^{d} \sum _{l = 1}^{c_{j}} I ( y _{j}^{i} = l ) ]}{\sum _{l = 1}^{c_{j}} { \sum _{i = 1}^{n} D _{ik}^{(t)} [ \sum _{j = 1}^{m} \sum _{l = 1}^{c_{j}} I ( y _{j}^{i} = l ) ] }} .

π_{k, j, l}^{(t)} = \frac{\sum _{i = 1}^{n} D _{ik}^{(t)} [ \sum _{j = 1}^{d} \sum _{l = 1}^{c_{j}} I ( y _{j}^{i} = l ) ]}{\sum _{l = 1}^{c_{j}} { \sum _{i = 1}^{n} D _{ik}^{(t)} [ \sum _{j = 1}^{m} \sum _{l = 1}^{c_{j}} I ( y _{j}^{i} = l ) ] }} .

\frac{\partial L}{\partial η _{k}} = i = 1 \sum n \frac{f ( y ^{i} ∣ θ _{k} )}{( \sum _{k = 1}^{K} η _{k} f ( y ^{i} ∣ θ _{k} ) )}, k = 1, \dots, K .

\frac{\partial L}{\partial η _{k}} = i = 1 \sum n \frac{f ( y ^{i} ∣ θ _{k} )}{( \sum _{k = 1}^{K} η _{k} f ( y ^{i} ∣ θ _{k} ) )}, k = 1, \dots, K .

\frac{\partial f ( y ^{i} ∣ π _{k} )}{\partial π _{k, j, l}} = I (y_{j}^{i} = l)  \neq = j \prod ℓ = 1 \prod c_{j} π_{k, , ℓ}^{I (y_{}^{i} = ℓ)},

\frac{\partial f ( y ^{i} ∣ π _{k} )}{\partial π _{k, j, l}} = I (y_{j}^{i} = l)  \neq = j \prod ℓ = 1 \prod c_{j} π_{k, , ℓ}^{I (y_{}^{i} = ℓ)},

\frac{\partial L}{\partial π _{k}} = i = 1 \sum n \frac{η _{k}}{( \sum _{k = 1}^{K} η _{k} f ( y ^{i} ∣ π _{k} ) )} \frac{\partial f ( y ^{i} ∣ π _{k} )}{\partial π _{k}} .

\frac{\partial L}{\partial π _{k}} = i = 1 \sum n \frac{η _{k}}{( \sum _{k = 1}^{K} η _{k} f ( y ^{i} ∣ π _{k} ) )} \frac{\partial f ( y ^{i} ∣ π _{k} )}{\partial π _{k}} .

\begin{array}[]{rl}\min_{x}&f(x)\\ \mbox{s.t.}&h_{j}(x)\,=\,0,\,\,\forall\,j=1,\cdots,m,\\ &x\,\in\,\mathcal{C}.\end{array}

\begin{array}[]{rl}\min_{x}&f(x)\\ \mbox{s.t.}&h_{j}(x)\,=\,0,\,\,\forall\,j=1,\cdots,m,\\ &x\,\in\,\mathcal{C}.\end{array}

x^{(t + 1)} = x^{(t)} + α_{t} d^{(t)},

x^{(t + 1)} = x^{(t)} + α_{t} d^{(t)},

q_{t} (x) = f (x^{(t)}) + (x - x^{(t)})^{T} g^{(t)} + \frac{1}{2} (x - x^{(t)})^{T} B^{(t)} (x - x^{(t)}),

q_{t} (x) = f (x^{(t)}) + (x - x^{(t)})^{T} g^{(t)} + \frac{1}{2} (x - x^{(t)})^{T} B^{(t)} (x - x^{(t)}),

\begin{array}[]{rl}z^{(t)}\,=\,\mbox{argmin}_{x}&q_{t}(x),\\ \mbox{s.t.}&h_{j}(x)\,=\,0,\,\,\forall\,j=1,\cdots,m,\\ &x\,\in\,\mathcal{C}.\end{array}

\begin{array}[]{rl}z^{(t)}\,=\,\mbox{argmin}_{x}&q_{t}(x),\\ \mbox{s.t.}&h_{j}(x)\,=\,0,\,\,\forall\,j=1,\cdots,m,\\ &x\,\in\,\mathcal{C}.\end{array}

f (x^{(t)} + α d_{t}) \leq f (x^{(t)}) + ν α (g^{(t)})^{T} d^{(t)},

f (x^{(t)} + α d_{t}) \leq f (x^{(t)}) + ν α (g^{(t)})^{T} d^{(t)},

\begin{array}[]{ll}\min&f(\boldsymbol{\theta})\\ \mbox{s.t}&\boldsymbol{\theta}\in\mathcal{F},\end{array}

\begin{array}[]{ll}\min&f(\boldsymbol{\theta})\\ \mbox{s.t}&\boldsymbol{\theta}\in\mathcal{F},\end{array}

\begin{array}[]{rl}\min&\|y-x\|_{2}^{2}\\ \mbox{s.t.}&y\in S.\end{array}

\begin{array}[]{rl}\min&\|y-x\|_{2}^{2}\\ \mbox{s.t.}&y\in S.\end{array}

q_{t} (θ) = f (θ^{(t)}) + (θ - θ^{(t)})^{T} g^{(t)} + \frac{1}{2} (θ - θ^{(t)})^{T} B^{(t)} (θ - θ^{(t)}),

q_{t} (θ) = f (θ^{(t)}) + (θ - θ^{(t)})^{T} g^{(t)} + \frac{1}{2} (θ - θ^{(t)})^{T} B^{(t)} (θ - θ^{(t)}),

\begin{array}[]{rl}\boldsymbol{\vartheta}{(t)}\,=\,\mbox{argmin}_{\boldsymbol{\theta}}&q_{t}(\boldsymbol{\theta}),\\ \mbox{s.t.}&\boldsymbol{\theta}\in\mathcal{F}.\end{array}

\begin{array}[]{rl}\boldsymbol{\vartheta}{(t)}\,=\,\mbox{argmin}_{\boldsymbol{\theta}}&q_{t}(\boldsymbol{\theta}),\\ \mbox{s.t.}&\boldsymbol{\theta}\in\mathcal{F}.\end{array}

\nabla q_{t} (θ) = \nabla f (θ^{(t)}) + (B^{(t)})^{T} (θ - θ^{(t)}) .

\nabla q_{t} (θ) = \nabla f (θ^{(t)}) + (B^{(t)})^{T} (θ - θ^{(t)}) .

B^{(t + 1)} = B^{(t)} - \frac{B ^{(t)} s ^{(t)} ( s ^{(t)} ) ^{T} B ^{(t)}}{( s ^{(t)} ) ^{T} B ^{(t)} s ^{(t)}} + \frac{y ^{(t)} ( y ^{(t)} ) ^{T}}{( y ^{(t)} ) ^{T} s ^{(t)}},

B^{(t + 1)} = B^{(t)} - \frac{B ^{(t)} s ^{(t)} ( s ^{(t)} ) ^{T} B ^{(t)}}{( s ^{(t)} ) ^{T} B ^{(t)} s ^{(t)}} + \frac{y ^{(t)} ( y ^{(t)} ) ^{T}}{( y ^{(t)} ) ^{T} s ^{(t)}},

B^{(t)} = σ_{t} I - N^{(t)} (M^{(t)})^{- 1} (N^{(t)})^{T},

B^{(t)} = σ_{t} I - N^{(t)} (M^{(t)})^{- 1} (N^{(t)})^{T},

\hat{θ} \to N (θ, - B_{θ}^{- 1}) .

\hat{θ} \to N (θ, - B_{θ}^{- 1}) .

\begin{array}[]{rl}\displaystyle{\min_{x}}&f(x)\\[5.0pt] \mbox{s.t.}&c_{j}(x)\,=\,0,j=1,2,\cdots,m_{e},\\ &c_{j}(x)\,\geq\,0,j=m_{e}+1,m_{e}+2,\cdots,m,\\ &x_{l}\,\leq\,x\,\leq\,x_{u},\end{array}

\begin{array}[]{rl}\displaystyle{\min_{x}}&f(x)\\[5.0pt] \mbox{s.t.}&c_{j}(x)\,=\,0,j=1,2,\cdots,m_{e},\\ &c_{j}(x)\,\geq\,0,j=m_{e}+1,m_{e}+2,\cdots,m,\\ &x_{l}\,\leq\,x\,\leq\,x_{u},\end{array}

L (x; λ) = f (x) - j = 1 \sum m λ_{j} c_{j} (x),

L (x; λ) = f (x) - j = 1 \sum m λ_{j} c_{j} (x),

\begin{array}[]{rl}\displaystyle{\min_{x}}&\displaystyle{\frac{1}{2}}(x-x^{(t)})^{T}B^{(t)}(x-x^{(t)})+\nabla f(x^{(t)})(x-x^{(t)})\\[5.0pt] \mbox{s.t.}&(\nabla c_{j}(x^{(t)}))^{T}(x-x^{(t)})+c_{j}(x^{(t)})\,=\,0,j=1,2,\cdots,m_{e},\\ &(\nabla c_{j}(x^{(t)}))^{T}(x-x^{(t)})+c_{j}(x^{(t)})\,\geq\,0,j=m_{e}+1,m_{e}+2,\cdots,m,\\ \end{array}

\begin{array}[]{rl}\displaystyle{\min_{x}}&\displaystyle{\frac{1}{2}}(x-x^{(t)})^{T}B^{(t)}(x-x^{(t)})+\nabla f(x^{(t)})(x-x^{(t)})\\[5.0pt] \mbox{s.t.}&(\nabla c_{j}(x^{(t)}))^{T}(x-x^{(t)})+c_{j}(x^{(t)})\,=\,0,j=1,2,\cdots,m_{e},\\ &(\nabla c_{j}(x^{(t)}))^{T}(x-x^{(t)})+c_{j}(x^{(t)})\,\geq\,0,j=m_{e}+1,m_{e}+2,\cdots,m,\\ \end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Beyond the EM Algorithm: Constrained Optimization Methods for Latent Class Model

Hao Chen111Corresponding Author. We thank the editor and the anonymous reviewer for suggestions that improved the manuscript.

[email protected]

Research & Development, Precima, Chicago, IL USA 60631

Lanshan Han

[email protected]

Research & Development, Precima, Chicago, IL USA 60631

Alvin Lim

[email protected]

Research & Development, Precima, Chicago, IL USA 60631

Abstract

Latent class model (LCM), which is a finite mixture of different categorical distributions, is one of the most widely used models in statistics and machine learning fields. Because of its non-continuous nature and flexibility in shape, researchers in areas such as marketing and social sciences also frequently use LCM to gain insights from their data. One likelihood-based method, the Expectation-Maximization (EM) algorithm, is often used to obtain the model estimators. However, the EM algorithm is well-known for its notoriously slow convergence. In this research, we explore alternative likelihood-based methods that can potential remedy the slow convergence of the EM algorithm. More specifically, we regard likelihood-based approach as a constrained nonlinear optimization problem, and apply quasi-Newton type methods to solve them. We examine two different constrained optimization methods to maximize the log-likelihood function. We present simulation study results to show that the proposed methods not only converge in less iterations than the EM algorithm but also produce more accurate model estimators.

(NOTE: the paper has been published online at here)

keywords:

Constrained Optimization, Quasi-Newton’s Method, Quadratic Programming, EM Algorithm, Latent Class Model, Finite Mixture Model.

††journal: CSSC. Accepted on Apr 28, 2020.

1 Introduction

Latent class model (LCM) (McCutcheon (1987)) is a model to study latent (unobserved) categorical variables by examining a group of observed categorical variables which are regarded as the indictors of the underlying latent variables. It can be regarded as a special case of the finite mixture model (FMM) with component distributions being categorical distributions. It is widely used to analyze ordered survey data collected from real world applications. In many applications in econometrics, social sciences, biometrics, and business analytics (see Hagenaars and McCutcheon (2002); Oser et al. (2013) for example), finite mixture of categorical distributions arises naturally when we sample from a population with heterogeneous subgroups. LCM is a powerful tool to conduct statistical inference from the collected data in such situations.

We provide a motivating example from White et al. (2014) where an LCM is applied to analyze a dataset of patient symptoms recorded in the Mercer Institute of St. James’ Hospital in Dublin, Ireland (Moran et al. (2004)). The data is a recording of the presence of six symptoms displayed by $240$ patients diagnosed with early onset Alzheimer’s disease. The six symptoms are as follows: hallucination, activity, aggression, agitation, diurnal and affective, and each symptom has two states: either present or absent. White et al. (2014) proposed to divide patients into $K=3$ groups such that patients are homogeneous within each group and heterogeneous between groups. Each group’s characteristics are summarized by the LCM parameters that help doctors prepare more specialized treatments. In this sense, LCM is a typical unsupervised statistical learning method that could “learn” the group labels based on the estimated parameters.

Due to its theoretical importance and practical relevance, many different approaches have been proposed to estimate the unknown parameters in LCMs from the observed data. In general, there are mainly two different paradigms. The first one is the frequentist’s approach of maximum likelihood estimation (MLE), i.e., one maximizes the log-likelihood as a function of the unknown parameters. In contrast, a second paradigm – the Bayesian approach – where the unknown parameters obey a distribution and assumes prior distributions on them, then one either analytically or numerically obtains the posterior distributions and statistical inference is carried out based on the posterior distributions.

In recent years, significant progress has been made on Bayesian inference in LCM. White et al. (2014), by assuming the Dirichlet distribution on each unknown parameter, used Gibbs sampling to iteratively draw samples from the posterior distribution and then conduct inference on the LCM using the samples drawn. The authors also provided an implementation of the approach in R. Li et al. (2018) described a similar Bayesian approach to estimate the parameters and they also utilized the Dirichlet distribution as the prior distribution. Asparouhov and Muthén (2011) introduced a similar implementation package of Bayesian LCM in Mplus. However, compared to the fast development of the Bayesian inference via Markov chain Monte Carlo (MCMC), the frequentist’s MLE approach for LCM has largely lagged. As far as we know, researchers still heavily rely on the expectation-maximization (EM) algorithm (Dempster et al. (1977)), even with its notoriously slow convergence (see for instance Meilijson (1989)), to maximize the log-likelihood function. It is known that some authors (Jamshidian and Jennrich (1997)) use Quasi-Newton methods as alternatives for the EM algorithm in Gaussian mixture models. However, the extension to LCM is not straightforward since LCM includes a lot more intrinsic constraints on the parameters than the general Gaussian mixture model when considered as an optimization problem. More sophisticated optimization methods need to be applied when maximizing the log-likelihood function.

This paper primarily focuses on the MLE paradigm. We propose the use of two widely-used constrained optimization methods to maximize the likelihood function, namely, the Projected Quasi-Newton method and the sequential quadratic programming method. Our contributions include not only exploring alternatives beyond the EM algorithm, but also demonstrating that better results could be obtained by using these alternatives. The rest of this paper is organized as follows: in Section 2, we present the preliminaries including the log-likelihood function and the classical EM algorithm. In Section 3, we introduce and discuss the two constrained optimization methods in detail. Some simulation studies and a real world data analysis are presented in Section 4 to compare the performance of the proposed methods with the EM algorithm. We make concluding remarks in Section 6.

2 Latent Class Models and the EM Algorithm

In many applications, a finite mixture distribution arises naturally when we sample from a population with heterogeneous subgroups, indexed by $k$ taking values in $\{1,\cdots,K\}$ . Consider a population composed of $K$ subgroups, mixed at random in proportion to the relative group sizes $\eta_{1},\cdots,\eta_{K}$ . There is a random feature $y$ , heterogeneous across and homogeneous within the subgroups. The feature $y$ obeys a different probability distribution, often from the same parametric family $p(y|\theta)$ with $\theta$ differing, for each subgroup. Now we sample from this population, if it is impossible to record the subgroup label, denoted by $s$ , then the density $p(y)$ is:

[TABLE]

which is a finite mixture distribution. In this situation, we often need to estimate the $\theta_{k}$ ’s as well as $\eta_{k}$ based on the random samples of $y$ , when the subgroup label $s$ is known or unknown. Throughout this paper, we assume that $K$ is known.

The LCM is a special case of the FMM. In LCM, the component densities are multivariate categorical distributions. That is, $\boldsymbol{y}=(\boldsymbol{y}_{1},\cdots,\boldsymbol{y}_{d})$ with each $\boldsymbol{y}_{j}$ being a categorical random variable, taking values from $c_{j}$ categories $\{1,\cdots,c_{j}\}$ . It is assumed that $\boldsymbol{y}_{j}$ ’s are independent within each subgroup with an indictor $s$ (the latent variable), which is a categorical random variable taking values in $\{1,\cdots,K\}$ , i.e., within each subgroup, the probability density function (PDF) is written as:

[TABLE]

where $\pi_{k,j,l}=\mbox{Pr}(\boldsymbol{y}_{j}=l|s=k)$ and $\mathcal{I}(\cdot)$ is the Iverson bracket function, i.e.

[TABLE]

Overall, the mixture density of latent class models is:

[TABLE]

where, the parameters $\boldsymbol{\theta}$ include both the weight distribution $\eta$ and the $\pi_{k,j,l}$ ’s that define the categorical distributions.

Suppose we have collected $N$ samples drawn from the LCM distribution, denoted by $\{y^{1},\cdots,y^{N}\}$ . We write $Y=\left[y^{1},\cdots,y^{N}\right]^{T}\in\mathbb{R}^{N\times d}$ as the data matrix. The log-likelihood function is given by

[TABLE]

The maximum likelihood principle is to find a $\boldsymbol{\theta}^{*}$ that maximizes the log-likelihood function (1) as the estimation of $\boldsymbol{\theta}$ . Clearly, we can regard the problem of finding such a $\boldsymbol{\theta}^{*}$ as an optimization problem. At the same time, we notice that the LCM implies several constraints that need to be satisfied when maximizing the log-likelihood function (1). In particular, the $\eta_{k}$ ’s are all nonnegative and sum up to 1. Also, for each $k=1,\cdots,K$ and $j=1,\cdots,d$ , the $\pi_{k,j,l}$ ’s are all nonnegative and sum up to 1. Let $\eta=(\eta_{k})_{k=1}^{K}$ be the vector of $\eta_{k}$ ’s and $\pi=\left(\pi_{k,j,l}\right)_{k=1,\cdots,K;j=1,\cdots,d;l=1,\cdots,c_{j}}$ be the vector of $\pi_{k,j,l}$ ’s. From an optimization point of view, the MLE in the LCM case is the following optimization problem.

[TABLE]

As we can see, the optimization problem (2) possesses $K\times d+1$ equality constraints together with nonnegativity constraints on all the individual decision variables. While there are considerable number of constraints, the feasible region in (2) is indeed the Cartesian product of $K\times d+1$ probability simplexes. We recall that a probability simplex in $n$ -dimensional space $\mathbb{R}^{n}$ is defined as

[TABLE]

where $\mathbb{R}^{n}_{+}$ is the nonnegative orthant of $\mathbb{R}^{n}$ . Let $\pi_{k,j}=\left(\pi_{k,j,l}\right)_{l=1}^{c_{j}}$ for all $k=1,\cdots,K$ and $j=1,\cdots,d$ . The constraints in (2) can be written as $\eta\in\mathcal{P}^{K}$ and $\pi_{k,j}\in\mathcal{P}^{c_{j}},\,\,\forall\,k=1,\cdots,K;j=1,\cdots,d$ .

To maximize the log-likelihood function in (1), the EM algorithm is a classical approach. In statistics, the EM algorithm is a generic framework that is commonly used in obtaining maximum likelihood estimators. The reason why the EM algorithm enjoys its popularity in finite mixture model is the fact that we can view finite mixture model as an estimation problem with missing data. More specifically, if we know the true label of each observation, we could obtain the MLE in a fairly straightforward fashion. On the other hand, if we know the true model parameters, it is also trivial to compute the probability each observation belonging to each class. Therefore, a natural idea is that we begin the process with an initial random guess of the parameters, and compute the probability each observation belonging to each class E(xpectation)-step. With those probabilities we compute the MLE, which is the M(aximization)-step. We iterate between the two steps until a convergence condition is reached. Particularly for the LCM, when the EM algorithm is applied to it, the constraints are implicitly satisfied for all the iterations thanks to the way the EM algorithm updates the values of the parameters. This nice property does not necessarily hold naturally when other non-linear optimization algorithms are applied to the optimization problem (2).

In the context of LCM, the details of the EM algorithm is given in Algorithm 1. We make two comments on Algorithm 1. First, Algorithm 1 does not produce standard errors of MLE as a by-product. In order to conduct statistical inference, one has to compute the observed Fisher information matrix and it could be algebraically tedious or might only apply to special cases. This is one of the criticisms often laid out against the EM algorithm as compared to Bayesian analysis using Gibbs samplers for example, where independent posterior samples are collected and statistical inference is easy under such circumstance. Second, the convergence of Algorithm 1 is typically slow. Wu (1983) studied the convergence issue of the EM algorithm and concluded that the convergence of the EM algorithm is sublinear when the Jacobian matrix of the unknown parameters is singular. Jamshidian and Jennrich (1997) also reported that the EM algorithm could well be accelerated by the Quasi-Newton method. In Section 4, we shall also empirically observe the two constrained optimization methods converge in less iterations than the EM algorithm.

3 Constrained Optimization Methods

Motivated by the significant progress in constrained non-linear optimization, as well as the constrained nature of the LCM estimation problem, we propose to apply two non-linear optimization approaches to solve the optimization problem (2). We notice that the EM algorithm is closely related to a gradient decent method Wu (1983), whose convergence rate is at most linear. On the other hand, it is known in optimization theory that if the second order information is utilized in the algorithm, quadratic convergence may be achieved, e.g., the classical Newton’s method. However, in many applications, it is often computationally very expensive to obtain the second order information, i.e., the Hessian matrix. One remedy is to use computationally cheap approximation of the Hessian matrix. This idea leads to the family of Quasi-Newton methods in the unconstrained case. While the convergence rate is typically only superlinear, the per iteration cost (both the execution time and the memory usage) is significantly reduced. In the constrained case, sophisticated methods have been developed to allow us to deal with the constraints. Given that it is relative easy to solve a constrained optimization problem when the objective function is quadratic and the constraints are all linear, one idea in constrained non-linear optimization is to approximate the objective function (or the Lagrangian function) by a quadratic function (via second-order Taylor expansion at the current solution) and approximate the constraints by linear constraints (via first-order Taylor expansion at the current solution). A new solution is obtained by solving the approximation and hence a new approximation can be constructed at the new solution. Analogous to the idea of quasi-Newton methods in the unconstrained case, in the constrained case, we can also consider an approximated Taylor expansion without having to compute the Hessian matrix exactly. Once an approximated quadratic program is obtained, one may use different approaches to solve it. For example, one can use an active set method or an interior point method to solve the quadratic program when it does not possess any specific structure. When the feasible region of the quadratic program is easily computable (typically in strongly polynomial time), a gradient projection method can be applied to solve the quadratic program approximation. As we have seen, the feasible region of optimization problem (2) is the Cartesian product of probability simplexes. It is known that projection on a probability simplex is computable in strongly polynomial time. Therefore, it is reasonable to apply a projection method to solve the quadratic program approximation. In the following subsections, the two approaches we propose are discussed in details. In both approaches, we need to evaluate the gradient of the LCM log-likelihood function. We provide the analytical expression below. For the $\eta$ part, we have:

[TABLE]

For the $\pi$ part, we have for all $i=1,\cdots,n;k=1,\cdots,K;j=1,\cdots,m;l=1,\cdots,c_{j}$ :

[TABLE]

where $\pi_{k}=(\pi_{k,j,l})_{j=1,\cdots,d;l=1,\cdots,c_{j}}$ . And therefore, for all $k=1,\cdots,K$ ,

[TABLE]

3.1 Limited Memory Projected Quasi-Newton Method

We first present the Projected Quasi-Newton method which is proposed by Schmidt et al. (2009). We augment it with the algorithm proposed by Wang and Carreira-Perpinán (2013) to project parameters onto a probability simplex in strongly polynomial time. In general, we address the problem of minimizing a differentiable function $f(x)$ over a convex set $\mathcal{C}$ subject to $m$ equality constraints:

[TABLE]

In an iterative algorithm, we update the next iteration as follows:

[TABLE]

where $x^{(t)}$ is the solution at the $t$ -th iteration, $\alpha_{t}$ is the step length and $d^{(t)}$ is the moving direction at iteration $t$ . Different algorithms differ in how $d^{(t)}$ and $\alpha_{t}$ are determined. In the Projected Quasi-Newton method, a quadratic approximation of the objective function around the current iterate $x^{(t)}$ is constructed as follows.

[TABLE]

where $g^{(t)}=\nabla f(x^{(t)})$ and $B^{(t)}$ denotes a positive-definite approximation of the Hessian $\nabla^{2}f(x^{(t)})$ . The projected quasi-Newton method then compute a feasible descent direction by minimizing this quadratic approximation subject to the original constraints:

[TABLE]

Then the moving direction is $d^{(t)}=z^{(t)}-x^{(t)}$ .

To determine the step length $\alpha_{t}$ , we ensure that a sufficient decrease condition, such as the Armijo condition is met:

[TABLE]

where $\nu\in(0,1)$ .

Although there are many appealing theoretical properties of projected Newton method just summarized, many obstacles prevent its efficient implementation in its original form. A major shortcoming is that minimizing (7) could be as difficult as optimizing (5). In Schmidt et al. (2009), the projected Newton method was modified into a more practical version which uses the limited memory BFGS update to obtain $B^{(t)}$ ’s and a Spectral Projected Gradient (SPG) Algorithm ((Birgin et al., 2000)) to solve the quadratic approximation (7).

To apply this Projected Quasi-Newton method to (2), we let $f(\boldsymbol{\theta}):=-L(\boldsymbol{\theta}|Y)$ . As we discussed in the previous section, we rewrite (2) as follows:

[TABLE]

where $\mathcal{F}=\mathcal{P}^{K}\otimes\bigotimes_{k=1}^{K}\bigotimes_{j=1}^{d}\mathcal{P}^{c_{j}}$ is the feasible region given in the format of the Cartesian product of $K\times d+1$ probability simplexes. This rewriting is to facilitate the projection operation. We denote $\Pi_{S}(x)$ as the projection of a vector $x\in\mathbb{R}^{n}$ on a closed convex set $S\subseteq\mathbb{R}^{n}$ , i.e. $\Pi_{s}(x)$ is the unique solution of the following quadratic program:

[TABLE]

As we can see, in general, a quadratic program needs to be solved to compute the projection onto a closed convex set, and hence is not computationally cheap. Fortunately, the feasible region in (9) allows for a projection computable in strongly polynomial time according to Wang and Carreira-Perpinán (2013). This algorithm is presented in Algorithm 4. This algorithm is the building block for the SPG algorithm to solve the quadratic approximation in each iteration. More specifically, in the $t$ -th iteration, let

[TABLE]

where $g^{(t)}=\nabla f(\boldsymbol{\theta}^{(t)})$ and $B^{(t)}$ denotes a positive-definite approximation of the Hessian $\nabla^{2}f(\boldsymbol{\theta}^{(t)})$ . The quadratic approximation is now given by

[TABLE]

The gradient of $q_{t}(\boldsymbol{\theta})$ is given by

[TABLE]

In our implementation, $\nabla f(\boldsymbol{\theta}^{(t)})$ is numerically approximated by the method of symmetric difference quotient with length chosen as $0.05$ . We can also compute $\nabla f(\theta^{(t)})$ using the analytical expressions (3) and (4).

We update $B^{(t)}$ using the limited memory version of BFGS. The non-limited memory BFGS update of $B$ is given by

[TABLE]

where $s^{(t)}=\boldsymbol{\theta}^{(t+1)}-\boldsymbol{\theta}^{(t)}$ and $y^{(t)}=\nabla f(\boldsymbol{\theta}^{(t+1)})-\nabla f(\boldsymbol{\theta}^{(t)})$ . This will consume significant memory in storing $B^{(t)}$ ’s when the number of features increases dramatically. Therefore, in the proposed Projected Quasi-Newton algorithm we only keep the most recent $m=5$ $Y$ and $S$ arrays (the definitions of $Y$ and $S$ are in Algorithm 2) and update $B^{(t)}$ using its compact representation described by Byrd et al. (1994):

[TABLE]

where $N^{(t)}$ and $M^{(t)}$ are explicitly given in equation (3.5) of Byrd et al. (1994).

In addition, running Algorithm 2 until convergence, the $B$ matrix is outputted as a by-product. The $-B$ matrix is an approximation of the observed Fisher information of the unknown parameters, which will enable us to construct asymptotic confidence intervals using the following classical results:

[TABLE]

This is way easier than the EM algorithm to conduct statistical inference. According to Gower and Richtárik (2017), when $f$ is convex quadratic function with positive definite Hessian matrix, it is expected that $-B^{(t)}$ from the Quasi-Newton method to converge to the true Hessian matrix. However, the log-likelihood function is obviously not a convex function and as far as we know there is no formal theory that guarantees the convergence. Nonetheless, in Section $6$ of Jamshidian and Jennrich (1997), the authors empirically compared the estimates for standard errors to the true values and the results are satisfactory.

In our implementation of Algorithm 2, we use $m=5,\epsilon=10^{-4}$ and the default parameters are $\alpha_{\text{min}}=10^{-10}$ , $\alpha_{\text{max}}=10^{10}$ , $h=1$ and $\nu=10^{-4}$ in Algorithm 3.

3.2 Sequential Quadratic Programming

Sequential quadratic programming (SQP) is a generic method for non-linear optimization with constraints. It is known as one of the most efficient computational method to solve the general nonlinear programming problem in (5) subject to both equality and inequality constraints. There are many variants of this algorithm, we use the version considered in Kraft (1988). We give a brief review of this method and then we will specifically talk about how this method could be applied to optimization problem (2).

Consider the following minimization problem

[TABLE]

where the problem functions $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ . SQP is also an iterative method and each iteration a quadratic approximation of the original problem is also constructed and solved to obtain the moving direction. Compared to the Projected Quasi-Newton method, in SQP, the quadratic approximations are typically solved by an active set method or an interior point method rather than a projection type method. This significantly complicates the algorithm, but also allows the algorithm to handle more general non-linear optimization problems, especially when the feasible region is too complex to admit an efficient projection computation. In particular, starting with a given vector of parameters $x^{(0)}$ , the moving direction $d^{(t)}$ at iteration $t$ is determined by a quadratic programming problem, which is formulated by a quadratic approximation of the Lagrangian function and a linear approximation of the constraints. Note that, in contrast to the Projected Quasi-Newton method we presented in the previous subsection, the SQP algorithm here approximates the Lagrangian function instead of the objective function itself. An advantage is that the dual information can be incorporated in the algorithm to ensure better convergence property. Let

[TABLE]

be the Lagrangian function associated with this optimization problem. This approximation is of the following standard form of quadratic programming:

[TABLE]

with

[TABLE]

as proposed in Wilson (1963). The multiplier $\lambda^{(t)}$ is updated using the multipliers of the constraints in ( \IfBeginWitheqn:SQP_QPfig:Figure 18\IfBeginWitheqn:SQP_QPeqn:18\IfBeginWitheqn:SQP_QPtab:Table 18Unsupported ref start).

In terms of the step length $\alpha$ , Han (1977) proved that a one-dimensional minimization of the non-differential exact penalty function

[TABLE]

with $|c_{j}(x)|_{-}=|\min\left(0;c_{j}(x)\right)|$ , as a merit function $\varphi:\mathbb{R}\rightarrow\mathbb{R}$

[TABLE]

with $x^{(t)}$ and $d^{(t)}$ fixed, leads to a step length $\alpha$ guaranteeing global convergence for values of the penalty parameters $\varrho_{j}$ greater than some lower bounds. Then, Powell (1978) proposed to update the penalty parameters according to

[TABLE]

where $\mu_{j}$ denotes the Lagrange multiplier of the $j$ -th constraint in the quadratic problem and $\varrho_{j}^{-}$ is the $j$ -th penalty parameter of the previous iteration, starting with some $\varrho_{j}^{0}=0$ .

It is important in practical applications to not evaluate $B^{(t)}$ in (19) in every iteration, but to use only first order information to approximate the Hessian matrix of the Lagrange function in (17). Powell (1978) proposed the following modification:

[TABLE]

with

[TABLE]

where

[TABLE]

and $\gamma_{t}$ is chosen as

[TABLE]

which ensures that $B^{(t+1)}$ remains positive definite within the linear manifold defined by the tangent planes to active constraints at $x^{(t+1)}$ .

In LCM, the problem turns out to be simpler: the quadratic programming problem in (18) is only subject to $m_{e}$ equality constraints. In addition, unless we use Projected Quasi-Newton method, for which we have to build our own solver, there is a popular implementation of SQP in Python’s SciPy package. The package uses a variant of SQP: Sequential Least SQuares Programming (SLSQP): It replaces the quadratic programming problem in (18) by a linear least squares problem using a stable $LDL^{T}$ factorization of the matrix $B^{(t)}$ .

4 Simulation Studies and Real Data Analysis

In this section, we provide four example bundles and one real data analysis to demonstrate the performance of the proposed methods. The model specifications of the four example bundles as follows:

Example Bundle $1$ , $N=500$ : (A) $d=1,K=2$ ; (B) $d=1,K=3$ ; (C) $d=2,K=2$ ; (D) $d=4,K=2$

2.

Example Bundle $2$ , $N=1000$ : (A) $d=2,K=2$ ; (B) $d=2,K=3$ ; (C) $d=3,K=2$ ; (D) $d=3,K=3$

3.

Example Bundle $3$ , $N=2000$ : (A) $d=3,K=3$ ; (B) $d=3,K=4$ ; (C) $d=4,K=4$ ; (D) $d=5,K=3$

4.

Example Bundle $4$ , $N=5000$ : (A) $d=4,K=4$ ; (B) $d=4,K=5$ ; (C) $d=5,K=4$ ; (D) $d=5,K=5$

One dataset is simulated from latent class model for each combination. In total, we consider $16$ datasets with different combinations of sample size, dimensionality and number of groups providing a comprehensive picture of the model performance.

4.1 Example Bundle 1

In this example bundle, we use the following three methods to maximize the log-likelihood function: (1) EM, (2) SQP, and (3) Projected Quasi-Newton (QN). Each method is repeated $10$ times with different initial values across the $10$ runs. At each run, the three methods begin with identical initial values. The true weights and categorical parameters are reported in Tables 8, 9, 10, 11 in the appendix. Side by side boxplots are drawn and reported in Figure 1 and Figure 2 showing number of iterations and log-likelihood values of the $10$ runs, respectively. For each method, the best result based on the log-likelihood values across the $10$ runs are given in Table 1. Results from the true parameters are also included in Table 1 as a comparison.

From Table 1, Figure 1 and Figure 2, we observe that the proposed two optimization methods have good performance compared to the traditional EM algorithm: the log-likelihood values are very close to that of EM for all four datasets in this example bundle. Note that the vertical axis scales are different in Figure 1. The numbers of iterations of the two proposed optimization methods are obviously lower than that of EM, for example the number of iterations of SQP and Projected QN are both $12$ compared to $88$ of the EM algorithm. This suggests that the two optimization methods are less likely to get stuck in local maxima.

In addition, there are no substantial differences between the final best solutions across the $10$ runs. Actually, the final best results are quite close to the results obtained from the other $9$ runs. Using scenario (A) with $d=1,K=2$ in this bundle as an example, we divide the $10$ log-likelihood values into two groups, where the first group contains the largest log-likelihood value only while the second group contains the rest of the nine log-likelihood values, and then fit a non-parametric two-group Wilcoxon signed-rank test Bauer (1972). The p-value is $0.20$ , which is clearly larger than the usual $0.05$ threshold. The parametric t-test might not work well here because the group sizes are too small. Moreover, the estimated weights and categorical parameters from the $10$ runs are also close to each other. We repeat the test for the log-likelihood values on the estimates for each of the weight and categorical parameters and none of the p-values are larger than $0.05$ .

4.2 Example Bundle 2

The true weights and categorical parameters are reported in Tables 12, 13, 14, 15 in the appendix. As in Example Bundle 1, each method is repeated $10$ times with different initial values across the $10$ runs. At each run, the three methods begin with identical initial values. The simulation results for this bundle are summarized in Figure 3 and 4 for number of iterations and log-likelihood, respectively. Similarly, for each method, the best result based on the log-likelihood values among the $10$ runs are given in Table 2. Results from the true parameters are also included in Table 2 for comparison.

From Table 2, Figure 3 and Figure 4, we observed a similar pattern as in Example Bundle 1, i.e., the log-likelihood values are close to each other, however the number of iterations of the two optimization methods are smaller than that of EM, further showing the promise of using the proposed optimization methods as alternatives in practice.

4.3 Example Bundle 3

With exactly the same settings, we report results for Example Bundle 3 in this section. The resulting number of iteration and log-likelihood values are reported in Figure 5 and 6, respectively. For each method, the best result based on the log-likelihood values among the $10$ runs are given in Table 3. Results from the true parameters are also included in Table 3 for comparison. The true weights and categorical parameters are reported in Tables 16, 17, 18, 19 in the appendix.

From Table 3, Figure 5 and Figure 6, we observed a similar pattern as in the previous examples: the number of iterations of the two optimization methods are much smaller than that of EM, while the log-likelihood values are quite close to each other for the three methods.

4.4 Example Bundle 4

The resulting number of iterations and log-likelihood values of Example Bundle $4$ are reported in Figure 7 and 8, respectively. For each method, the best result based on the log-likelihood values among the ten runs are given in Table 4. Results from the true parameters are also included in Table 4 for comparison. The true weights and categorical parameters are repeated in Tables 20, 21, 22, 23 in the appendix.

These results in Example Bundle 4 further confirm what we have observed: with the same settings, the two optimization methods converge in less iterations than EM, while they still yield comparable log-likelihood values as EM. This strengthens the promise of using the two proposed optimization methods as alternatives to EM when estimating a latent class model.

4.5 An Application

We now go back to the motivating example discussed in Section 1. The data set is available in the R package BayesLCA. White et al. (2014) used a $K=3$ latent class model to fit the data using Gibbs sampler. It is clear that this is an $n=240,d=6$ binary data set. We follow the recommendation of Moran et al. (2004) and fit a $K=3$ latent class model with (1) EM, (2) SQP, and (3) Projected Quasi-Newton methods and $10$ different initial points, and the best result of each method is recorded based on the log-likelihood value. The results are summarized in Table 5. The result from BayesLCA package (White et al., 2014) is also included. The side-by-side boxplots for number of iterations and log-likelihood values of the $10$ runs are reported in Figure 9.

From Table 5, SQP has the best performance in terms both the log-likelihood value and the number of iterations. The results from EM and Projected Quasi-Newton are very similar although EM needs way more iterations to converge. This agrees with the previous observations. We also note that all the three methods considered have larger log-likelihood values than that of BayesLCA. The method proposed by White et al. (2014) actually has the smallest log-likelihood value.

In addition, since we do not know the true values, we computed pairwise root mean squared error (RMSE) based on the estimates, i.e, we compute RMSE of estimates for every two methods. Since we have considered four different methods, we will have six RMSEs, one number for each pair of methods. The results are reported in Table 6

The results in Table 6 are consistent with the observations we have made: Since the log-likelihood values are closer for EM, SQP and Projected QN, the pairwise RMSE of these three methods are way lower than those when paired with BayesLCA, for example the RMSE of SQP and EM is $0.029$ , while the RMSE for SQP and BayesLCA is $0.241$ , which is over eight times larger.

5 Discussion

In the previous section, we have shown the number of iterations of the proposed methods is smaller than that of the EM algorithm. In this section, we report the comparison of CPU times. Taking the application as an example, the runtime are reported in Table 7.

From Table 7, we can see that the EM algorithm indeed has the lowest CPU time per iteration. However, when taking the number of iterations into account, the story is different: using SQP as example, the number of iterations of EM and SQP are $302$ and $44$ , respectively. The number of iterations for the SQP algorithm is around $1/8$ of the EM algorithm, although the CPU time per iteration is merely about four times longer. Therefore, the total computational time of the SQP algorithm is significantly less than that of the EM algorithm. In the application example, the computational times of the SQP and Projected QN methods are respectively $43\%$ and $19\%$ better compared to the EM algorithm.

6 Concluding Remarks

The primary research objective of the paper is to provide alternative methods to learn the unknown parameters of the latent class model. Given the log-likelihood as a function of the parameters, we aim to find estimators that can maximize the log-likelihood function. The traditional way is to use the EM algorithm. However, it is observed that the EM algorithm converges slowly. Therefore, in this paper, we propose the use of two constrained optimization methods, namely the Sequential Quadratic Programming and the Projected Quasi-Newton methods as alternatives. Simulation studies and the real example in Section 4 reveal that the two proposed methods perform well. The obvious advantages we observed are as follows: (1) the two optimization methods produced slightly larger log-likelihood values compared to the EM algorithm; (2) they converge in significantly less iterations than the EM algorithm. That being said, we want to make it clear that the aim is not to completely replace the EM algorithm, rather we would like to provide alternative ways of achieving the same goal using some optimization methods. Inter-disciplinary collaboration between researchers in statistics and mathematical optimization has never been as important as in the big data era.

About the Authors

Hao Chen received his Ph.D. in Statistics from the University of British Columbia and is currently a Senior Data Scientist at Precima. Lanshan Han holds a Ph.D. in Decision Sciences and Engineering Systems from the Rensselaer Polytechnic Institute and is currently a Director of Research and Development at Precima. Alvin Lim received his Ph.D. in Mathematical Sciences from the Johns Hopkins University and is currently Precima’s Chief Scientist and Vice President for Research and Development.

Appendix

The Python source codes for EM and Projected Quasi-Newton for LCM are available upon request. The implementation of SQP is available in Python SciPy package.

The true weights and parameters used in Section 4 are given below.

Example Bundle 1

Example Bundle 2

Example Bundle 3

Example Bundle 4

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Asparouhov and Muthén (2011) Asparouhov, T., Muthén, B., 2011. Using bayesian priors for more flexible latent class analysis, American Statistical Association.
2Bauer (1972) Bauer, D., 1972. Constructing confidence sets using rank statistics. Journal of the American Statistical Association 67, 687–690.
3Birgin et al. (2000) Birgin, E.G., Martínez, J.M., Raydan, M., 2000. Nonmonotone spectral projected gradient methods on convex sets. SIAM Journal on Optimization 10, 1196–1211.
4Byrd et al. (1994) Byrd, R.H., Nocedal, J., Schnabel, R.B., 1994. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming 63, 129–156.
5Dempster et al. (1977) Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) , 1–38.
6Gower and Richtárik (2017) Gower, R.M., Richtárik, P., 2017. Randomized quasi-newton updates are linearly convergent matrix inversion algorithms. SIAM Journal on Matrix Analysis and Applications 38, 1380–1409.
7Hagenaars and Mc Cutcheon (2002) Hagenaars, J.A., Mc Cutcheon, A.L., 2002. Applied latent class analysis. 64, Cambridge University Press.
8Han (1977) Han, S.P., 1977. A globally convergent method for nonlinear programming. Journal of optimization theory and applications 22, 297–309.