A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso   Problems

Yangjing Zhang; Ning Zhang; Defeng Sun; Kim-Chuan Toh

arXiv:1906.04647·math.OC·April 23, 2021·SIAM J. Optim.

A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso Problems

Yangjing Zhang, Ning Zhang, Defeng Sun, Kim-Chuan Toh

PDF

TL;DR

This paper introduces an efficient proximal point dual Newton algorithm for solving the group graphical Lasso model, enabling simultaneous learning of multiple related graphical models with shared sparsity patterns.

Contribution

The paper proposes a novel PPDNA method for the non-polyhedral group graphical Lasso, achieving superlinear convergence and demonstrating high efficiency and robustness in experiments.

Findings

01

Superlinear convergence of PPDNA for group graphical Lasso.

02

High efficiency and robustness shown in numerical experiments.

03

Effective in learning multiple related graphical models.

Abstract

Undirected graphical models have been especially popular for learning the conditional independence structure among a large number of variables where the observations are drawn independently and identically from the same distribution. However, many modern statistical problems would involve categorical data or time-varying data, which might follow different but related underlying distributions. In order to learn a collection of related graphical models simultaneously, various joint graphical models inducing sparsity in graphs and similarity across graphs have been proposed. In this paper, we aim to propose an implementable proximal point dual Newton algorithm (PPDNA) for solving the group graphical Lasso model, which encourages a shared pattern of sparsity across graphs. Though the group graphical Lasso regularizer is non-polyhedral, the asymptotic superlinear convergence of our proposed…

Tables3

Table 1. Table 1 : Performances of PPDNA, ADMM, and MGL on university webpages data.

Problem	$(λ_{1}, λ_{2})$	Density	Iteration			Time			Error
$(p, K)$			P	A	M	P	A	M	P	A	M
	(1e-02,1e-03)	0.016	16(24)	501	4	02	04	10	3.2e-07	9.9e-07	2.9e-07
Webtest	(5e-03,5e-04)	0.048	16(25)	501	6	02	04	13	3.2e-07	9.9e-07	2.6e-07
(100,4)	(1e-03,1e-04)	0.225	14(22)	529	32	02	04	56	2.4e-07	9.9e-07	1.0e-06
	(1e-02,1e-03)	0.008	14(24)	850	5	08	25	37	7.9e-07	1.0e-06	6.7e-08
Webtest	(5e-03,5e-04)	0.026	14(27)	679	7	10	19	50	7.9e-07	9.8e-07	4.1e-07
(200,4)	(1e-03,1e-04)	0.163	13(23)	503	77	07	11	05:57	2.7e-07	9.9e-07	1.6e-06
	(5e-03,5e-04)	0.016	14(29)	744	8	28	33	02:39	5.6e-07	9.9e-07	5.3e-08
Webtest	(1e-03,1e-04)	0.125	16(32)	487	205	45	22	21:51	5.9e-07	9.9e-07	1.8e-06
(300,4)	(5e-04,5e-05)	0.256	14(35)	668	1128	55	30	01:37:51	3.9e-07	9.9e-07	2.1e-06
	(1e-02,1e-03)	0.012	20(34)	1601	3	03	11	08	1.2e-07	1.0e-06	6.0e-06
Webtrain	(5e-03,5e-04)	0.033	20(34)	1601	5	03	12	04	1.2e-07	1.0e-06	7.4e-07
(100,4)	(1e-03,1e-04)	0.165	20(34)	1601	22	04	13	43	1.2e-07	1.0e-06	7.8e-06
	(5e-03,5e-04)	0.016	20(39)	1325	7	13	27	01:18	1.3e-07	1.0e-06	1.3e-06
Webtrain	(1e-03,1e-04)	0.108	20(37)	1397	31	11	31	02:49	9.9e-08	1.0e-06	5.0e-06
(200,4)	(5e-04,5e-05)	0.219	20(39)	1397	88	16	32	05:00	1.1e-07	1.0e-06	5.3e-06
	(5e-03,5e-04)	0.011	20(62)	1826	10	01:04	01:37	08:35	2.4e-07	9.7e-07	2.7e-06
Webtrain	(1e-03,1e-04)	0.080	19(33)	1196	45	21	52	09:55	3.4e-07	1.0e-06	6.0e-06
(300,4)	(5e-04,5e-05)	0.177	19(36)	1196	134	36	52	13:00	3.7e-07	1.0e-06	6.0e-06

Table 2. Table 2 : Performances of PPDNA, ADMM, and MGL on 20 newsgroups data.

Problem	$(λ_{1}, λ_{2})$	Density	Iteration			Time			Error
$(p, K)$			P	A	M	P	A	M	P	A	M
NGcomp	(5e-03,5e-04)	0.021	15(22)	509	31	16	26	37:08	6.5e-08	9.9e-07	1.1e-06
test	(1e-03,1e-04)	0.099	16(26)	625	510	32	34	01:20:37	7.9e-07	1.0e-06	2.0e-06
(300,5)	(5e-04,5e-05)	0.210	14(24)	494	1481	40	27	03:00:00	7.2e-07	1.0e-06	6.1e-06
NGrec	(5e-03,5e-04)	0.004	21(38)	1331	5	15	49	04:04	8.0e-08	1.0e-06	4.9e-07
test	(1e-03,1e-04)	0.063	21(39)	1331	13	20	58	04:28	8.2e-08	1.0e-06	1.9e-06
(300,4)	(5e-04,5e-05)	0.143	20(37)	1331	36	20	58	07:49	3.7e-07	1.0e-06	3.7e-06
NGsci	(5e-03,5e-04)	0.006	17(26)	542	6	14	21	05:25	3.8e-07	1.0e-06	2.1e-06
test	(1e-03,1e-04)	0.075	17(27)	553	21	19	24	11:17	3.9e-07	9.8e-07	2.1e-06
(300,4)	(5e-04,5e-05)	0.167	17(31)	550	87	25	24	17:13	5.0e-07	9.9e-07	2.6e-06
NGtalk	(5e-03,5e-04)	0.026	15(25)	482	16	24	26	26:08	9.2e-08	9.6e-07	4.1e-07
test	(1e-03,1e-04)	0.115	12(23)	278	81	20	13	13:39	1.2e-07	9.9e-07	1.1e-06
(300,3)	(5e-04,5e-05)	0.240	11(22)	286	337	25	13	40:28	9.4e-08	9.7e-07	2.2e-06
NGcomp	(5e-03,5e-04)	0.016	16(31)	1150	13	33	57	14:58	1.2e-07	1.0e-06	5.6e-08
train	(1e-03,1e-04)	0.080	15(31)	1153	172	35	01:04	40:11	4.6e-07	1.0e-06	1.9e-06
(300,5)	(5e-04,5e-05)	0.153	15(30)	1216	574	33	01:07	01:12:51	4.4e-07	1.0e-06	1.8e-06
NGrec	(5e-03,5e-04)	0.005	19(35)	1519	5	22	52	02:36	1.4e-07	1.0e-06	3.9e-07
train	(1e-03,1e-04)	0.068	18(37)	1500	16	31	01:06	09:45	2.6e-07	1.0e-06	4.8e-06
(300,4)	(5e-04,5e-05)	0.124	18(35)	1542	48	28	01:07	09:02	2.9e-07	1.0e-06	5.4e-06
NGsci	(5e-03,5e-04)	0.011	17(30)	1387	10	21	54	10:00	1.7e-07	1.0e-06	3.5e-08
train	(1e-03,1e-04)	0.086	16(32)	1389	40	32	01:01	08:57	5.1e-07	1.0e-06	2.6e-06
(300,4)	(5e-04,5e-05)	0.152	16(32)	965	206	37	42	18:10	4.1e-07	9.9e-07	2.9e-06
NGtalk	(5e-03,5e-04)	0.026	18(32)	2445	13	26	01:41	13:24	2.6e-07	1.0e-06	1.9e-06
train	(1e-03,1e-04)	0.103	17(32)	2448	52	19	01:46	13:26	1.4e-07	1.0e-06	4.4e-06
(300,3)	(5e-04,5e-05)	0.204	17(38)	2385	213	28	01:45	19:43	7.2e-08	1.0e-06	4.9e-06

Table 3. Table 3 : Performances of PPDNA, ADMM, and MGL on stock price data.

Problem	$(λ_{1}, λ_{2})$	Density	Iteration			Time			Error
$(p, K)$			P	A	M	P	A	M	P	A	M
	(1e-04,1e-05)	0.039	22(33)	3644	6	04	22	28	1.1e-07	1.0e-06	9.8e-07
SPX500	(5e-05,5e-06)	0.138	22(35)	3646	8	05	23	25	1.5e-07	1.0e-06	8.4e-06
(100,3)	(2e-05,2e-06)	0.238	23(43)	2056	18	08	13	08	4.4e-07	1.0e-06	2.5e-05
	(1e-04,1e-05)	0.025	24(31)	1409	8	12	23	01:20	8.8e-08	1.0e-06	9.4e-06
SPX500	(5e-05,5e-06)	0.084	21(28)	1239	17	14	20	04:23	9.4e-08	1.0e-06	9.0e-06
(200,3)	(2e-05,2e-06)	0.150	20(38)	1363	32	18	23	03:28	1.4e-07	9.9e-07	2.0e-05
	(5e-04,5e-05)	0.030	22(29)	3701	11	12	01:01	02:20	4.3e-08	1.0e-06	5.4e-06
SPX500	(1e-04,1e-05)	0.127	22(30)	3722	105	18	01:21	02:34	9.3e-08	1.0e-06	7.6e-06
(100,11)	(5e-05,5e-06)	0.206	22(30)	2925	393	21	01:06	09:12	7.8e-07	1.0e-06	7.8e-06
	(5e-04,5e-05)	0.018	19(24)	1096	28	31	53	36:07	8.4e-07	1.0e-06	4.1e-06
SPX500	(1e-04,1e-05)	0.082	19(24)	1125	481	49	01:08	01:27:17	7.9e-07	1.0e-06	5.1e-06
(200,11)	(5e-05,5e-06)	0.140	19(27)	1101	1258	01:05	01:08	03:00:00	6.1e-07	1.0e-06	1.7e-05

Equations142

\min\limits_{\Theta}~{}\sum^{K}_{k=1}\Big{(}-\log\det\,\Theta^{(k)}+\langle S^{(k)},\Theta^{(k)}\rangle\Big{)}+\mathcal{P}(\Theta),

\min\limits_{\Theta}~{}\sum^{K}_{k=1}\Big{(}-\log\det\,\Theta^{(k)}+\langle S^{(k)},\Theta^{(k)}\rangle\Big{)}+\mathcal{P}(\Theta),

\begin{array}[]{l}\mathcal{P}(\Theta)\displaystyle=\lambda_{1}\sum^{K}_{k=1}\sum_{i\neq j}|\Theta^{(k)}_{ij}|+\lambda_{2}\sum_{i\neq j}\Big{(}\sum^{K}_{k=1}|{\Theta^{(k)}_{ij}}|^{2}\Big{)}^{1/2},\end{array}

\begin{array}[]{l}\mathcal{P}(\Theta)\displaystyle=\lambda_{1}\sum^{K}_{k=1}\sum_{i\neq j}|\Theta^{(k)}_{ij}|+\lambda_{2}\sum_{i\neq j}\Big{(}\sum^{K}_{k=1}|{\Theta^{(k)}_{ij}}|^{2}\Big{)}^{1/2},\end{array}

P (Θ) = i \neq = j \sum φ (Θ_{[ij]}) with φ (x) = λ_{1} ∥ x ∥_{1} + λ_{2} ∥ x ∥, \forall x \in R^{K},

P (Θ) = i \neq = j \sum φ (Θ_{[ij]}) with φ (x) = λ_{1} ∥ x ∥_{1} + λ_{2} ∥ x ∥, \forall x \in R^{K},

Ψ_{g} (u) := u^{'} min {g (u^{'}) + \frac{1}{2} ∥ u^{'} - u ∥^{2}}, \forall u \in E

Ψ_{g} (u) := u^{'} min {g (u^{'}) + \frac{1}{2} ∥ u^{'} - u ∥^{2}}, \forall u \in E

Prox_{g} (u) := ar g u^{'} min {g (u^{'}) + \frac{1}{2} ∥ u^{'} - u ∥^{2}}, \forall u \in E .

Prox_{g} (u) := ar g u^{'} min {g (u^{'}) + \frac{1}{2} ∥ u^{'} - u ∥^{2}}, \forall u \in E .

\nabla Ψ_{g} (u) = u - Prox_{g} (u), \forall u \in E .

\nabla Ψ_{g} (u) = u - Prox_{g} (u), \forall u \in E .

Prox_{σ g} (u) + σ Prox_{σ^{- 1} g^{*}} (u / σ) = u, \forall u \in E, σ > 0,

Prox_{σ g} (u) + σ Prox_{σ^{- 1} g^{*}} (u / σ) = u, \forall u \in E, σ > 0,

P (Θ) = i \neq = j \sum φ (Θ_{[ij]}) with φ (x) = λ_{1} ∥ x ∥_{1} + λ_{2} ∥ x ∥, \forall x \in R^{K} .

P (Θ) = i \neq = j \sum φ (Θ_{[ij]}) with φ (x) = λ_{1} ∥ x ∥_{1} + λ_{2} ∥ x ∥, \forall x \in R^{K} .

\begin{array}[]{l}{\rm Prox}_{\mathcal{P}}(X)=\arg\underset{\Theta\in\mathbb{Z}}{\min}\left\{\mathcal{P}(\Theta)+\frac{1}{2}\|\Theta-X\|^{2}\right\}\\[8.53581pt] =\arg\underset{{\Theta\in\mathbb{Z}}}{\min}\left\{\sum_{i\neq j}\left\{\varphi(\Theta_{[ij]})+\frac{1}{2}\|\Theta_{[ij]}-X_{[ij]}\|^{2}\right\}+\frac{1}{2}{\sum_{i}}\|\Theta_{[ii]}-X_{[ii]}\|^{2}\right\}.\end{array}

\begin{array}[]{l}{\rm Prox}_{\mathcal{P}}(X)=\arg\underset{\Theta\in\mathbb{Z}}{\min}\left\{\mathcal{P}(\Theta)+\frac{1}{2}\|\Theta-X\|^{2}\right\}\\[8.53581pt] =\arg\underset{{\Theta\in\mathbb{Z}}}{\min}\left\{\sum_{i\neq j}\left\{\varphi(\Theta_{[ij]})+\frac{1}{2}\|\Theta_{[ij]}-X_{[ij]}\|^{2}\right\}+\frac{1}{2}{\sum_{i}}\|\Theta_{[ii]}-X_{[ii]}\|^{2}\right\}.\end{array}

(Prox_{P} (X))_{[ij]} = {Prox_{φ} (X_{[ij]}), X_{[ii]}, if i \neq = j, if i = j .

(Prox_{P} (X))_{[ij]} = {Prox_{φ} (X_{[ij]}), X_{[ii]}, if i \neq = j, if i = j .

\begin{array}[]{l}\widehat{\partial}{\rm Prox}_{\varphi}(u)\\[5.69054pt] =\left\{(I-\Sigma)\Lambda\in\mathbb{S}^{K}\big{|}\,v={\rm Prox}_{\lambda_{1}\|\cdot\|_{1}}(u),\,\Sigma\in\partial\Pi_{\mathbb{B}_{\lambda_{2}}}(v),\,\Lambda\in\partial{\rm Prox}_{\lambda_{1}\|\cdot\|_{1}}(u)\right\},\end{array}

\begin{array}[]{l}\widehat{\partial}{\rm Prox}_{\varphi}(u)\\[5.69054pt] =\left\{(I-\Sigma)\Lambda\in\mathbb{S}^{K}\big{|}\,v={\rm Prox}_{\lambda_{1}\|\cdot\|_{1}}(u),\,\Sigma\in\partial\Pi_{\mathbb{B}_{\lambda_{2}}}(v),\,\Lambda\in\partial{\rm Prox}_{\lambda_{1}\|\cdot\|_{1}}(u)\right\},\end{array}

\displaystyle\left\{\begin{array}[]{l}\mathcal{W}\in\widehat{\partial}{\rm Prox}_{\mathcal{P}}(X)\hbox{ if and only if $\exists$\;$M^{(ij)}\in\widehat{\partial}{\rm Prox}_{\varphi}(X_{[ij]})$ $\forall\;i<j$},\\[2.84526pt] \hbox{such that}\ (\mathcal{W}[Y])_{[ij]}=\left\{\begin{array}[]{ll}M^{(ij)}Y_{[ij]},&\hbox{if $i<j$},\\[1.42262pt] Y_{[ii]},&\hbox{if $i=j$},\\[2.27621pt] M^{(ji)}Y_{[ij]},&\hbox{if $j<i$},\end{array}\right.\,\,i,j=1,\ldots,p,\,\,\,\forall\,Y\in\mathbb{Z}.\end{array}\right.

\displaystyle\left\{\begin{array}[]{l}\mathcal{W}\in\widehat{\partial}{\rm Prox}_{\mathcal{P}}(X)\hbox{ if and only if $\exists$\;$M^{(ij)}\in\widehat{\partial}{\rm Prox}_{\varphi}(X_{[ij]})$ $\forall\;i<j$},\\[2.84526pt] \hbox{such that}\ (\mathcal{W}[Y])_{[ij]}=\left\{\begin{array}[]{ll}M^{(ij)}Y_{[ij]},&\hbox{if $i<j$},\\[1.42262pt] Y_{[ii]},&\hbox{if $i=j$},\\[2.27621pt] M^{(ji)}Y_{[ij]},&\hbox{if $j<i$},\end{array}\right.\,\,i,j=1,\ldots,p,\,\,\,\forall\,Y\in\mathbb{Z}.\end{array}\right.

Prox_{P} (Y) - Prox_{P} (X) - W [Y - X] = O (∣∣ Y - X ∣ ∣^{2}), \forall W \in \partial Prox_{P} (Y) .

Prox_{P} (Y) - Prox_{P} (X) - W [Y - X] = O (∣∣ Y - X ∣ ∣^{2}), \forall W \in \partial Prox_{P} (Y) .

h(X):=\left\{\begin{array}[]{ll}-\log\det\,X,&\hbox{if $X\in\mathbb{S}^{p}_{++}$,}\\[2.84526pt] +\infty,&\hbox{otherwise}.\end{array}\right.

h(X):=\left\{\begin{array}[]{ll}-\log\det\,X,&\hbox{if $X\in\mathbb{S}^{p}_{++}$,}\\[2.84526pt] +\infty,&\hbox{otherwise}.\end{array}\right.

ϕ_{β}^{+} (x) := (x^{2} + 4 β + x) /2, ϕ_{β}^{-} (x) := (x^{2} + 4 β - x) /2, \forall x \in R .

ϕ_{β}^{+} (x) := (x^{2} + 4 β + x) /2, ϕ_{β}^{-} (x) := (x^{2} + 4 β - x) /2, \forall x \in R .

ϕ_{β}^{+} (A) := Q Diag (ϕ_{β}^{+} (d_{1}), \dots, ϕ_{β}^{+} (d_{p})) Q^{T}, ϕ_{β}^{-} (A) := Q Diag (ϕ_{β}^{-} (d_{1}), \dots, ϕ_{β}^{-} (d_{p})) Q^{T} .

ϕ_{β}^{+} (A) := Q Diag (ϕ_{β}^{+} (d_{1}), \dots, ϕ_{β}^{+} (d_{p})) Q^{T}, ϕ_{β}^{-} (A) := Q Diag (ϕ_{β}^{-} (d_{1}), \dots, ϕ_{β}^{-} (d_{p})) Q^{T} .

\begin{array}[]{l}\phi^{+}_{\beta}(A)={\rm Prox}_{\beta h}(A)=\underset{B\in\mathbb{S}^{p}_{++}}{\arg\min}\big{\{}h(B)+\frac{1}{2\beta}\|B-A\|^{2}\big{\}},\\[11.38109pt] \Psi_{\beta h}(A)=\underset{B\in\mathbb{S}^{p}_{++}}{\min}\left\{\beta h(B)+\frac{1}{2}\|B-A\|^{2}\right\}=-\beta\log\det(\phi^{+}_{\beta}(A))+\frac{1}{2}\|\phi^{-}_{\beta}(A)\|^{2}.\end{array}

\begin{array}[]{l}\phi^{+}_{\beta}(A)={\rm Prox}_{\beta h}(A)=\underset{B\in\mathbb{S}^{p}_{++}}{\arg\min}\big{\{}h(B)+\frac{1}{2\beta}\|B-A\|^{2}\big{\}},\\[11.38109pt] \Psi_{\beta h}(A)=\underset{B\in\mathbb{S}^{p}_{++}}{\min}\left\{\beta h(B)+\frac{1}{2}\|B-A\|^{2}\right\}=-\beta\log\det(\phi^{+}_{\beta}(A))+\frac{1}{2}\|\phi^{-}_{\beta}(A)\|^{2}.\end{array}

(ϕ_{β}^{+})^{'} (A) [B] = Q (Γ ⊙ (Q^{T} B Q)) Q^{T},

(ϕ_{β}^{+})^{'} (A) [B] = Q (Γ ⊙ (Q^{T} B Q)) Q^{T},

Γ_{ij} = \frac{ϕ _{β}^{+} ( d _{i} ) + ϕ _{β}^{+} ( d _{j} )}{( d _{i}^{2} + 4 β ) ^{1/2} + ( d _{j}^{2} + 4 β ) ^{1/2}}, i, j = 1, 2, \dots, p .

Γ_{ij} = \frac{ϕ _{β}^{+} ( d _{i} ) + ϕ _{β}^{+} ( d _{j} )}{( d _{i}^{2} + 4 β ) ^{1/2} + ( d _{j}^{2} + 4 β ) ^{1/2}}, i, j = 1, 2, \dots, p .

f (Θ) := k = 1 \sum K h (Θ^{(k)}) + ⟨ S^{(k)}, Θ^{(k)} ⟩, Θ \in Z .

f (Θ) := k = 1 \sum K h (Θ^{(k)}) + ⟨ S^{(k)}, Θ^{(k)} ⟩, Θ \in Z .

\begin{array}[]{rl}\min\limits_{\Omega,\Theta}&\displaystyle\left\{f(\Omega)+\mathcal{P}(\Theta)\,|\,\Omega-\Theta=0\right\}.\end{array}

\begin{array}[]{rl}\min\limits_{\Omega,\Theta}&\displaystyle\left\{f(\Omega)+\mathcal{P}(\Theta)\,|\,\Omega-\Theta=0\right\}.\end{array}

L (Ω, Θ, X) := f (Ω) + P (Θ) + ⟨ Ω - Θ, X ⟩, (Ω, Θ, X) \in Z \times Z \times Z

L (Ω, Θ, X) := f (Ω) + P (Θ) + ⟨ Ω - Θ, X ⟩, (Ω, Θ, X) \in Z \times Z \times Z

\max\limits_{X}~{}\sum^{K}_{k=1}\big{(}\log\det(X^{(k)}+S^{(k)})+p\big{)}-\mathcal{P}^{*}(X).

\max\limits_{X}~{}\sum^{K}_{k=1}\big{(}\log\det(X^{(k)}+S^{(k)})+p\big{)}-\mathcal{P}^{*}(X).

\begin{array}[]{c}X+{\rm Prox}_{f^{*}}(\Omega-X)=0,\,\,-X+{\rm Prox}_{\mathcal{P}^{*}}(\Theta+X)=0,\,\,\Omega-\Theta=0.\end{array}

\begin{array}[]{c}X+{\rm Prox}_{f^{*}}(\Omega-X)=0,\,\,-X+{\rm Prox}_{\mathcal{P}^{*}}(\Theta+X)=0,\,\,\Omega-\Theta=0.\end{array}

\begin{array}[]{rl}\min\limits_{\Omega,\Theta}&\displaystyle f(\Omega)+\mathcal{P}(\Theta)-\langle(U,V),(\Omega,\Theta)\rangle\\[5.69054pt] {\rm s.t.}&\Omega-\Theta+W=0.\end{array}

\begin{array}[]{rl}\min\limits_{\Omega,\Theta}&\displaystyle f(\Omega)+\mathcal{P}(\Theta)-\langle(U,V),(\Omega,\Theta)\rangle\\[5.69054pt] {\rm s.t.}&\Omega-\Theta+W=0.\end{array}

\begin{array}[]{l}\mathcal{T}_{\mathcal{L}}(\Omega,\Theta,X)\\[5.69054pt] :=\{(U,V,W)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}\,|\,(U,V,-W)\in\partial{\mathcal{L}}(\Omega,\Theta,X)\},\,\,\,(\Omega,\Theta,X)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}.\end{array}

\begin{array}[]{l}\mathcal{T}_{\mathcal{L}}(\Omega,\Theta,X)\\[5.69054pt] :=\{(U,V,W)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}\,|\,(U,V,-W)\in\partial{\mathcal{L}}(\Omega,\Theta,X)\},\,\,\,(\Omega,\Theta,X)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}.\end{array}

S (U, V, W) := T_{L}^{- 1} (U, V, W) = the set of all KKT points for problem \eqref model-MGL-P .

S (U, V, W) := T_{L}^{- 1} (U, V, W) = the set of all KKT points for problem \eqref model-MGL-P .

\mathcal{H}((U,V,W),(\Omega,\Theta,X))=\left(\begin{array}[]{c}X-U+{\rm Prox}_{f^{*}}(\Omega-X+U)\\[2.84526pt] -X-V+{\rm Prox}_{\mathcal{P}^{*}}(\Theta+X+V)\\[2.84526pt] \Omega-\Theta+W\end{array}\right).

\mathcal{H}((U,V,W),(\Omega,\Theta,X))=\left(\begin{array}[]{c}X-U+{\rm Prox}_{f^{*}}(\Omega-X+U)\\[2.84526pt] -X-V+{\rm Prox}_{\mathcal{P}^{*}}(\Theta+X+V)\\[2.84526pt] \Omega-\Theta+W\end{array}\right).

\begin{array}[]{l}\|\mathcal{S}(U,V,W)-\mathcal{S}(U^{\prime},V^{\prime},W^{\prime})\|\\[4.2679pt] \leq\kappa\|(U,V,W)-(U^{\prime},V^{\prime},W^{\prime})\|,\,\,\forall\,(U,V,W),(U^{\prime},V^{\prime},W^{\prime})\in\mathcal{N}.\end{array}

\begin{array}[]{l}\|\mathcal{S}(U,V,W)-\mathcal{S}(U^{\prime},V^{\prime},W^{\prime})\|\\[4.2679pt] \leq\kappa\|(U,V,W)-(U^{\prime},V^{\prime},W^{\prime})\|,\,\,\forall\,(U,V,W),(U^{\prime},V^{\prime},W^{\prime})\in\mathcal{N}.\end{array}

{\mathcal{G}}(\Delta\Omega,\Delta\Theta,\Delta X)=\left(\begin{array}[]{c}\Delta X+\mathcal{G}_{f^{*}}(\Delta\Omega-\Delta X)\\[2.84526pt] -\Delta X+\mathcal{G}_{\mathcal{P}^{*}}(\Delta\Theta+\Delta X)\\[2.84526pt] \Delta\Omega-\Delta\Theta\end{array}\right),\,\,\,\forall\,(\Delta\Omega,\Delta\Theta,\Delta X)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}.

{\mathcal{G}}(\Delta\Omega,\Delta\Theta,\Delta X)=\left(\begin{array}[]{c}\Delta X+\mathcal{G}_{f^{*}}(\Delta\Omega-\Delta X)\\[2.84526pt] -\Delta X+\mathcal{G}_{\mathcal{P}^{*}}(\Delta\Theta+\Delta X)\\[2.84526pt] \Delta\Omega-\Delta\Theta\end{array}\right),\,\,\,\forall\,(\Delta\Omega,\Delta\Theta,\Delta X)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso Problems

Yangjing Zhang Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076 ([email protected]).

Ning Zhang (Corresponding author) College of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China; Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China ([email protected]). This author is supported in part by the National Natural Science Foundation of China under Grant 11901083

Defeng Sun Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China ([email protected]). This author is supported in part by Hong Kong Research Grant Council under Grant PolyU 153014/18P

Kim-Chuan Toh Department of Mathematics, and Institute of Operations Research and Analytics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076 ([email protected]). This author is supported in part by the Academic Research Fund of the Ministry of Education of Singapore under Grant R-146-000-257-112.

(July, 22, 2020)

Abstract

Undirected graphical models have been especially popular for learning the conditional independence structure among a large number of variables where the observations are drawn independently and identically from the same distribution. However, many modern statistical problems would involve categorical data or time-varying data, which might follow different but related underlying distributions. In order to learn a collection of related graphical models simultaneously, various joint graphical models inducing sparsity in graphs and similarity across graphs have been proposed. In this paper, we aim to propose an implementable proximal point dual Newton algorithm (PPDNA) for solving the group graphical Lasso model, which encourages a shared pattern of sparsity across graphs. Though the group graphical Lasso regularizer is non-polyhedral, the asymptotic superlinear convergence of our proposed method PPDNA can be obtained by leveraging on the local Lipschitz continuity of the Karush-Kuhn-Tucker solution mapping associated with the group graphical Lasso model. A variety of numerical experiments on real data sets illustrates that the PPDNA for solving the group graphical Lasso model can be highly efficient and robust.

Keywords. Group Graphical Lasso, Proximal Point Algorithm, Semismooth Newton Method, Lipschitz Continuity

AMS Subject Classification. 90C22, 90C25, 90C31, 62J10.

1 Introduction

Let $w^{(k)}\in\mathbb{R}^{n_{k}\times p},\,k=1,2,\ldots,K$ be $K$ given data matrices. For each $k=1,2,\ldots,K$ , the rows of $w^{(k)}$ are observations drawn independently from a Gaussian distribution with mean zero, and the empirical covariance matrix for $w^{(k)}$ is given by $S^{(k)}=(1/n_{k})(w^{(k)})^{T}w^{(k)}$ . In this paper, we consider the following joint graphical model:

[TABLE]

where $\Theta=\big{(}\Theta^{(1)},\Theta^{(2)},\dots,\Theta^{(K)}\big{)}\in\mathbb{S}^{p}\times\mathbb{S}^{p}\times\cdots\times\mathbb{S}^{p}$ is the decision variable, and $\mathcal{P}$ is a convex penalty term that can promote certain desired structure in the decision variable $\Theta$ . Throughout this paper, we assume that the solution set to problem (1.1) is nonempty.

If $K=1$ and $\mathcal{P}(\cdot)=\lambda\|\cdot\|_{1}$ , problem (1.1) reduces to the well-known sparse Gaussian graphical model which has been studied by various researchers (e.g., [1, 2, 10, 14, 27, 33, 36]). In many applications, a single Gaussian graphical model is typically enough to capture the conditional independence structure of the random variables. However, in some situations it is more reasonable to fit a collection of such models jointly, due to the similarity or heterogeneity of the data involved. These models for estimating multiple precision matrices jointly are referred to as joint graphical models in [8]. A scenario where joint graphical models are more suitable than a single graphical model is when the data comes from several distinct but closely related classes, which share the same collection of variables but differ in terms of the dependency structures. Their dependency graphs can have common edges across a portion of all classes and unique edges restricted to only certain classes. In this case, fitting separate graphical models for distinct classes does not exploit the similarity among the dependency graphs. In contrast, joint estimation of these models could exploit information across different but related classes. In addition to the data from different classes, another scenario that would favor joint graphical models over a single graphical model is when the data contains sequences of multivariate time-stamped observations. Such data might correspond to a series of dependency graphs over time. Next, we give two practical applications of joint graphical models, which will also be used in our numerical experiments:

The inference of words relationships from webpages or newsgroups: the webpages from the computer science departments of various universities are classified into several classes: Student, Faculty, Course, Project, etc. The 20 newsgroups are grouped into various topics.

-

The inference of time-varying dependency structures of stocks: the dependency structures among the Standard & Poor’s 500 component stocks might change smoothly over time.

In summary, there are two major applications of the joint graphical models: (i) estimating multiple precision matrices jointly for a collection of variables across distinct classes; (ii) inferring the time-varying networks and finding the change-points.

For solving problem (1.1) with different forms of penalty terms, the alternating direction method of multipliers (ADMM) has been extensively used; see, e.g., [8, 12, 13]. As we know, the ADMM could be a fast first order method for finding approximate solutions of low or moderate accuracy. However, for attaining superlinear convergence to compute highly accurate solutions, one has to incorporate at least in part the second order information of the problem. Yang et al. [34] proposed a proximal Newton-type method, where the subproblem in each iteration can be solved by the nonmonotone spectral projected gradient method [19, 32], and an active set identification scheme was applied to reduce the cost. Another notable contribution is that a screening rule, which can be combined with any method to reduce the computational cost, was proposed in [34]. However, the second order method in [34] is not without drawbacks. Each of its subproblems is a complicated quadratic approximation problem, which generally requires expensive computations. Besides, the inexact proximal Newton-type method proposed in [34] has no guarantee of local linear convergence. It is worth noting that, in a recent paper related to [34], Yue, Zhou, and So [37] studied the local convergence rate of a family of inexact proximal Newton-type methods for solving a class of nonsmooth convex composite optimization problems based on an error bound condition. However, it is not clear to us whether the convergence analysis in [37] can be directly applied to problem (1.1) as the Hessian of the first function in the objective of the problem is not uniformly bounded on its effective domain. More recently, Zhang et al. [38] applied a regularized proximal point algorithm (rPPA) to solve a fused multiple graphical Lasso (FGL) model and heavily exploited the underlying second order information through the semismooth Newton method when solving the subproblems of the rPPA. Due to the polyhedral property of the FGL regularizer, the rPPA for solving the FGL problem is proven to have an arbitrary linear convergence rate in [38].

Our goal in this paper is to design and analyze an efficient second order information based algorithm with economical implementations and a fast convergence rate for solving problem (1.1) with the following non-polyhedral regularizer, which was referred to as the group graphical Lasso (GGL) regularizer in [8]:

[TABLE]

where $\lambda_{1}$ and $\lambda_{2}$ are positive parameters. We refer to model (1.1) with the regularizer (1.2) as the GGL model. In fact, the GGL regularizer acting on a collection of matrices can be viewed as an extension of the sparse group Lasso regularizer [11, 28] acting on a vector. The former can be regarded as the latter if the $(i,j)$ -th elements across all $K$ precision matrices are assigned into one group. For $1\leq i,j\leq p$ , we let $\Theta_{[ij]}:=[\Theta^{(1)}_{ij};\ldots;\Theta^{(K)}_{ij}]\in\mathbb{R}^{K}$ be the column vector obtained by taking out the $(i,j)$ -th elements across all $K$ matrices $\Theta^{(k)},\,k=1,2,\dots,K$ . We can observe that

[TABLE]

where the function $\varphi$ is actually a special sparse group Lasso regularizer. The first term of the GGL regularizer promotes sparsity in the $K$ estimated precision matrices $\Theta^{(k)}$ ’s. The zeros in these precision matrices tend to occur at the same indices due to the second term of the GGL regularizer. In addition, Figure 1 illustrates the structure of the decision variable $\Theta$ and the vector belonging to one group $\Theta_{[ij]}$ .

Inspired by the impressive numerical performance of the rPPA for solving the FGL model [38], we will design a proximal point dual Newton algorithm (PPDNA) for solving the GGL model. Specifically, a proximal point algorithm (PPA) [24] is applied to the primal formulation of the GGL model, and a superlinearly convergent semismooth Newton method is designed to solve the dual formulations of the PPA subproblems. Thanks to the fact that the GGL regularizer is an extension of the sparse group Lasso regularizer, the generalized Jacobian of the proximal mapping of the GGL regularizer can be characterized based on that of the sparse group Lasso regularizer, where the explicit form was given in [39]. As a result, the former naturally inherits the structured sparsity (referred to as the second order sparsity) of the latter. Consequently, multiplying a sparse Hessian matrix by a vector in the semismooth Newton method is reasonably cheap, and one could expect that the superlinearly convergent semismooth Newton method is numerically efficient for solving the PPA subproblems. In addition to achieving low cost in computing the semismooth Newton directions by exploiting the second order sparsity, we also establish the linear convergence guarantee of the PPDNA.

Though the framework of the PPDNA for solving the GGL model is closely related to the rPPA for solving the FGL model [38] and the semismooth Newton based augmented Lagrangian method (SSNAL) for solving the sparse group Lasso problems [39], both the theoretical analysis and numerical implementation should be further investigated owing to the following difficulties of the GGL model. First, unlike the FGL regularizer, the GGL regularizer is a non-polyhedral function and consequently the Lipschitz continuity of the Karush-Kuhn-Tucker (KKT) solution mapping associated with the GGL model is not as straightforward to establish as in [38]. We should mention here that the Lipschitz continuity of the KKT solution mapping plays an important role in establishing the convergence rate of the PPDNA, just as in the case of rPPA and SSNAL. Second, the subproblem of the PPDNA for solving the GGL model differs from those of the SSNAL and rPPA which are strongly convex. Therefore, the stopping criteria previously used in SSNAL and rPPA are no longer applicable. The main contributions of this paper can be summarized as follows.

We prove the Lipschitz continuity of the KKT solution mapping associated with the GGL model, by taking advantage of the strict convexity of the function $-\log\det(\cdot)$ in its effective domain, the nonsingularity of its Jacobian, and Clarke’s implicit function theorem [5, 6]. Consequently, the linear convergence of the iterative sequence generated by the PPDNA can be established based on the classical results in [24]. Moreover, by choosing the penalty parameter to be sufficiently large, the PPDNA can be made to attain any desired linear convergence rate. More generally, the Lipschitz continuity of the KKT solution mapping of the model still holds even if the GGL regularizer is replaced by any other convex positively homogeneous function.

2.

We derive a surrogate generalized Jacobian of the proximal mapping of the GGL regularizer. The second order sparsity in the surrogate generalized Jacobian is analyzed in depth and fully exploited in the PPDNA. Therefore, the superlinearly (or even quadratically) convergent semismooth Newton method can solve the PPA subproblems very efficiently since the semismooth Newton directions can be computed cheaply.

3.

We introduce fairly easy-to-check stopping criteria (via the duality theory) for computing inexact solutions of the PPA subproblems without sacrificing the global or linear convergence of the PPDNA. In fact, the standard stopping criteria adopted by Rockafellar [24] would involve the unknown optimal values of the subproblems, which are not easy to check unless the objective function is strongly convex with an explicitly given strong convexity parameter.

The remaining parts of the paper are organized as follows. Section 2 presents some definitions and preliminary results, which include the proximal mapping of the GGL regularizer, its generalized Jacobian, the proximal mapping of the log-determinant function and its derivative. We analyze in section 3 the Lipschitz continuity of the KKT solution mapping associated with the GGL model, which is the key property for deriving the linear convergence rate of our proposed algorithm. In section 4, we propose the PPDNA for solving the GGL model and investigate its convergence properties. We report the numerical performance of the PPDNA on categorical text data and time-varying stock prices data in section 5 and conclude the paper in section 6.

Notation. The following notation will be used in the rest of the paper.

•

$\mathbb{S}^{p}_{+}$ ( $\mathbb{S}^{p}_{++}$ ) denotes the cone of positive semidefinite (definite) matrices in the space of $p\times p$ real symmetric matrices $\mathbb{S}^{p}$ . For any $A,\,B\in\mathbb{S}^{p}$ , we write $A\succeq B$ if $A-B\in\mathbb{S}^{p}_{+}$ and $A\succ B$ if $A-B\in\mathbb{S}^{p}_{++}$ . In particular, $A\succeq 0$ ( $A\succ 0$ ) indicates that $A\in\mathbb{S}^{p}_{+}$ ( $A\in\mathbb{S}^{p}_{++}$ ).

•

We let ${\mathbb{Z}}\;({\mathbb{Z}}_{+},\;{\mathbb{Z}}_{++})$ be the Cartesian product of $K$ copies of $\mathbb{S}^{p}\;(\mathbb{S}^{p}_{+},\,\mathbb{S}^{p}_{++})$ .

•

For any matrix $A$ , $A_{ij}$ denotes the $(i,j)$ -th element of $A$ .

•

For any $X:=(X^{(1)},\dots,X^{(K)})\in\mathbb{Z}$ , $X_{[ij]}:=[X^{(1)}_{ij};\ldots;X^{(K)}_{ij}]\in\mathbb{R}^{K}$ denotes the column vector obtained by taking out the $(i,j)$ -th elements across all $K$ matrices $X^{(k)},\,k=1,2,\dots,K$ .

•

$I_{n}$ denotes the $n\times n$ identity matrix, and $I$ denotes an identity matrix or map when the dimension is clear from the context.

•

We use $\lambda_{\max}(\mathcal{A})$ to denote the largest eigenvalue of a self-adjoint linear operator $\mathcal{A}$ .

•

For a given closed convex set $\Omega$ and a vector $x$ , we denote the Euclidean projection of $x$ onto $\Omega$ by $\Pi_{\Omega}(x):=\arg\min_{x^{\prime}\in{\Omega}}\{\|x-x^{\prime}\|\}$ .

•

We denote ceil $(x)$ as the smallest integer greater than or equal to $x\in\mathbb{R}$ .

2 Preliminaries

In this section, we first recall the definition and some relevant properties of the Moreau-Yosida regularization of a proper and closed convex function, which will play an important role in the subsequent theoretical analysis and algorithmic design. Let $\mathcal{E}$ be a finite dimensional real Hilbert space and $g:\,\mathcal{E}\rightarrow{\mathbb{R}}\cup\{+\infty\}$ be a proper and closed convex function. The Moreau-Yosida regularization [21, 35] of $g$ is defined by

[TABLE]

and the proximal mapping of $g$ , the unique minimizer of (2.1), is given by

[TABLE]

One critical property of the Moreau-Yosida regularization is that $\Psi_{g}(\cdot)$ is a continuously differentiable convex function with the following gradient:

[TABLE]

In addition, the proximal mapping satisfies the following Moreau identity [25, Theorem 31.5]:

[TABLE]

where $g^{*}$ is the conjugate function of $g$ (see e.g., [25] for its definition).

2.1 Proximal mapping of the GGL regularizer and its generalized Jacobian

We investigate in this section the proximal mapping of the GGL regularizer $\mathcal{P}$ in (1.2) and its generalized Jacobian. Recall the function in (1.3):

[TABLE]

By definition, the proximal mapping of $\mathcal{P}$ is given as follows: for any $X\in\mathbb{Z}$ ,

[TABLE]

It is obvious that problem (2.3) is separable for each vector $\Theta_{[ij]}\in\mathbb{R}^{K}$ . Therefore, for any $i,j\in\{1,2,\ldots,p\}$ , the vector $({\rm Prox}_{\mathcal{P}}(X))_{[ij]}$ , consisting of all entries of ${\rm Prox}_{\mathcal{P}}(X)$ in the $(i,j)$ -th position, is given explicitly by

[TABLE]

By this equation, one can compute ${\rm Prox}_{\mathcal{P}}$ via performing $p(p-1)/2$ computations of ${\rm Prox}_{\varphi}$ , and this task can be done in parallel. Parts of the second order information of the underlying problem are contained in the generalized Jacobian of ${\rm Prox}_{\mathcal{P}}$ , which can be characterized by the generalized Jacobian of ${\rm Prox}_{\varphi}$ through using the relationship (2.4) between ${\rm Prox}_{\mathcal{P}}$ and ${\rm Prox}_{\varphi}$ . Fortunately, the generalized Jacobian of ${\rm Prox}_{\varphi}$ has been carefully investigated in [39] and has an explicit expression.

Let the multifunction $\widehat{\partial}{\rm Prox}_{\varphi}:\,\mathbb{R}^{K}\rightrightarrows\mathbb{S}^{K}$ be the generalized Jacobian of ${\rm Prox}_{\varphi}$ . Directly from the formula (10) in [39], the multifunction $\widehat{\partial}{\rm Prox}_{\varphi}$ can be described as follows: for any $u\in\mathbb{R}^{K}$ ,

[TABLE]

where ${\mathbb{B}_{\lambda_{2}}}:=\{v\in{\mathbb{R}^{K}}\,|\,\|v\|\leq\lambda_{2}\}$ , $\partial\Pi_{\mathbb{B}_{\lambda_{2}}}$ and $\partial{\rm Prox}_{\lambda_{1}\|\cdot\|_{1}}$ are the Clarke generalized Jacobians (see [5, Definition 2.6.1] for the definition) of $\Pi_{\mathbb{B}_{\lambda_{2}}}$ and ${\rm Prox}_{\lambda_{1}\|\cdot\|_{1}}$ , respectively. Therefore, the surrogate generalized Jacobian $\widehat{\partial}{\rm Prox}_{\mathcal{P}}(X):\,\mathbb{Z}\rightrightarrows\mathbb{Z}$ of ${\rm Prox}_{\mathcal{P}}$ at any given $X$ can be described as follows:

[TABLE]

The next proposition will explain why $\widehat{\partial}{\rm Prox}_{\mathcal{P}}(X)$ in (2.11) can be treated as the surrogate generalized Jacobian of ${\rm Prox}_{\mathcal{P}}$ at $X$ . Based on [39, Theorem 3.1], one can easily prove the proposition. We omit the details here.

Proposition 2.1.

Let $\mathcal{P}$ be the GGL regularizer defined by (1.2) and $X\in\mathbb{Z}$ be any given element. The surrogate generalized Jacobian $\widehat{\partial}{\rm Prox}_{\mathcal{P}}(\cdot)$ defined in (2.11) is nonempty compact valued and upper semicontinuous. Any element in the set $\widehat{\partial}{\rm Prox}_{\mathcal{P}}(X)$ is a self-adjoint and positive semidefinite operator. Moreover, we have that, for any $Y\to X$ ,

[TABLE]

2.2 Properties of the log-determinant function

In this subsection, we present some properties on the proximal mapping of the following log-determinant function $h$ and its derivative that are mainly adopted from the papers [31, 33]:

[TABLE]

Let $\beta>0$ be given. Define the following scalar functions:

[TABLE]

In addition, for any $A\in\mathbb{S}^{p}$ with eigenvalue decomposition $A=Q{\rm Diag}(d_{1},\ldots,d_{p})Q^{T}$ , we define

[TABLE]

One can observe that $\phi^{+}_{\beta}(A)$ and $\phi^{-}_{\beta}(A)$ are positive definite for any $A\in\mathbb{S}^{p}$ . Using the functions defined above, the following two propositions give the proximal mapping of the log-determinant function $h$ and its derivative.

Proposition 2.2.

[33, Proposition 2.3]** Let $h$ be the log-determinant function defined by (2.12) and $\beta$ be a positive scalar. Then, for any $A\in\mathbb{S}^{p}$ , it holds that

[TABLE]

Proposition 2.3.

[31, Lemma 2.1 (b)]** Let $\beta$ be a given positive scalar. The function $\phi^{+}_{\beta}:\,\mathbb{S}^{p}\to\mathbb{S}^{p}$ is continuously differentiable. For any $A\in\mathbb{S}^{p}$ with eigenvalue decomposition $A=Q{\rm Diag}(d_{1},\dots,d_{p})Q^{T}$ , the derivative $(\phi^{+}_{\beta})^{\prime}(A)[B]$ at any $B\in\mathbb{S}^{p}$ is given by

[TABLE]

where $\Gamma\in\mathbb{S}^{p}$ is defined by

[TABLE]

3 Lipschitz continuity of the KKT solution mapping

In this section, we will prove that the KKT solution mapping associated with the GGL problem is Lipschitz continuous. More generally, we emphasize that the Lipschitz continuity of the KKT solution mapping still holds even if the GGL regularizer is replaced by any other convex positively homogeneous function, since the key properties we need from the regularizer $\mathcal{P}$ are convexity and positive homogeneity.

The analysis in this section is based on Clarke’s implicit function theorem. For notational convenience, we denote

[TABLE]

Then the GGL problem (1.1) can be equivalently reformulated as follows:

[TABLE]

The Lagrangian function associated with problem (3.2) is given by

[TABLE]

and the dual problem of (3.2) is easily shown to be

[TABLE]

In addition, the KKT system associated with (3.2) and (3.3) is given by

[TABLE]

Since the log-determinant function $h$ is strictly convex and the solution set to problem (1.1) is assumed to be nonempty, problem (3.2) has a unique solution. Furthermore, by using [3, Proposition 4.75] one can easily show that the KKT system (3.4) also has a unique solution, denoted by $(\overline{\Omega},\overline{\Theta},\overline{X})$ .

For any given $(U,V,W)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ , we consider the following linearly perturbed form of problem (3.2)

[TABLE]

As in Rockafellar [23], we define the following maximal monotone operator:

[TABLE]

We also define the KKT solution mapping $\mathcal{S}:\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}\to\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ as

[TABLE]

Define a mapping $\mathcal{H}:(\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z})\times(\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z})\to\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ as follows: for any $(U,V,W)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ and $(\Omega,\Theta,X)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ ,

[TABLE]

Then it is easy to see that if $\mathcal{S}(U,V,W)$ is nonempty, then it must be a singleton and satisfies $\mathcal{H}((U,V,W),\mathcal{S}(U,V,W))=0.$

Lemma 3.1.

Let $Z\in\mathbb{Z}$ and $f$ be defined by (3.1). Then all $\mathcal{G}_{f}\in\partial{\rm Prox}_{f}(Z)$ and $\mathcal{G}_{f^{*}}\in\partial{\rm Prox}_{f^{*}}(Z)$ are self-adjoint and positive definite with $\lambda_{\max}({\mathcal{G}}_{f})<1$ and $\lambda_{\max}(\mathcal{G}_{f^{*}})<1$ .

Proof.

The proof can be derived from [31, Lemma 2.1]. ∎

Since the GGL regularizer $\mathcal{P}$ defined by (1.2) is positively homogeneous, its conjugate function $\mathcal{P}^{*}$ is an indicator function of a closed convex set [26, Example 11.4(a)]. Therefore, ${\rm Prox}_{\mathcal{P}^{*}}$ is the projection onto a closed convex set. We know further from [30, Theorem 2.3] that for any $Y\in\mathbb{Z}$ , any element in $\partial{\rm Prox}_{\mathcal{P}^{*}}(Y)$ is a self-adjoint operator whose eigenvalues are in the interval $[0,1]$ . Thus, by the proof of [30, Theorem 2.5], we can obtain the following lemma, which will be used in Theorem 3.1 to analyze the Lipschitz continuity of the KKT solution mapping $\mathcal{S}$ defined by (3.6).

Lemma 3.2.

Let $Y\in\mathbb{Z}$ and $\mathcal{B}:\mathbb{Z}\to\mathbb{Z}$ be any self-adjoint positive definite operator. Then, for any chosen $\mathcal{G}_{\mathcal{P}^{*}}\in\partial{\rm Prox}_{\mathcal{P}^{*}}(Y)$ , the linear operator $I-\mathcal{G}_{\mathcal{P}^{*}}+\mathcal{G}_{\mathcal{P}^{*}}\mathcal{B}$ is nonsingular.

The next theorem will play an essential role in establishing the linear rate of convergence of our proposed proximal point dual Newton algorithm (PPDNA) for solving the GGL problems in section 4.3.

Theorem 3.1.

Let $\mathcal{S}:\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}\to\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ be the KKT solution mapping defined by (3.6). Then the following hold:

(a)

Any element in $\partial_{(\Omega,\Theta,X)}\mathcal{H}((0,0,0),(\overline{\Omega},\overline{\Theta},\overline{X}))$ is nonsingular. Here, we say

$\mathcal{G}\in\partial_{(\Omega,\Theta,X)}\mathcal{H}((0,0,0),(\overline{\Omega},\overline{\Theta},\overline{X}))$ * if for some linear operator $\mathcal{M}$ , it holds that $(\mathcal{M},\mathcal{G})\in\partial\mathcal{H}((0,0,0),(\overline{\Omega},\overline{\Theta},\overline{X}))$ .*

(b)

The mapping $\mathcal{S}$ is Lipschitz continuous near the origin; i.e., there exist a neighborhood $\mathcal{N}$ of the origin and a positive scalar $\kappa$ such that $\mathcal{S}(U,V,W)\neq\emptyset$ for any $(U,V,W)\in\mathcal{N}$ and

[TABLE]

Proof.

Since ${\rm Prox}_{\mathcal{P}}$ is directionally differentiable, we know from the Moreau identity (2.2) that ${\rm Prox}_{\mathcal{P}^{*}}$ is also directionally differentiable. Therefore, it follows from the chain rule presented in [29, Lemma 2.1] that for any $\mathcal{G}\in\partial_{(\Omega,\Theta,X)}\mathcal{H}((0,0,0),(\overline{\Omega},\overline{\Theta},\overline{X}))$ , there exist $\mathcal{G}_{f^{*}}\in\partial{\rm Prox}_{f^{*}}(\overline{\Omega}-\overline{X})$ and $\mathcal{G}_{\mathcal{P}^{*}}\in\partial{\rm Prox}_{\mathcal{P}^{*}}(\overline{\Theta}+\overline{X})$ such that

[TABLE]

Suppose that there exists $(\Delta\Omega,\Delta\Theta,\Delta X)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ such that ${\mathcal{G}}(\Delta\Omega,\Delta\Theta,\Delta X)=0$ , i.e.,

[TABLE]

It follows from Lemma 3.1 that both $\mathcal{G}_{f^{*}}$ and $\mathcal{B}:=\mathcal{G}^{-1}_{f^{*}}-I$ are self-adjoint and positive definite. This, together with (3.9), implies that

[TABLE]

where $\mathcal{C}:=I-\mathcal{G}_{\mathcal{P}^{*}}+\mathcal{G}_{\mathcal{P}^{*}}\mathcal{B}.$ We know from Lemma 3.2 that $\mathcal{C}$ is nonsingular. This, together with (3.9) and (3.10), implies that

[TABLE]

Therefore, $\mathcal{G}$ is nonsingular, and consequently the statement $(a)$ holds.

The global Lipschitz continuities of the proximal mappings ${\rm Prox}_{f^{*}}$ and ${\rm Prox}_{\mathcal{P}^{*}}$ imply that the mapping $\mathcal{H}$ defined by (3.7) is Lipschitz continuous. Therefore, the proof of $(b)$ can be obtained by $(a)$ , the fact that for any $(U,V,W)\in\mathbb{Z}\times\mathbb{Z}\times\mathbb{Z}$ , the set $\mathcal{S}(U,V,W)$ must be a singleton if it is nonempty, and Clarke’s implicit function theorem [5, Page 256] or [6, Theorem 3.6]. The proof is completed. ∎

4 Proximal point dual Newton algorithm

We aim to develop an implementable proximal point dual Newton algorithm (PPDNA) for solving the GGL problem (3.2). The PPDNA is essentially a proximal point algorithm (PPA) for solving the primal form of the GGL model, and the PPA subproblems are solved via their corresponding dual problems. The dual of each subproblem is to maximize a concave function whose gradient is a semismooth function and thus can be solved by the semismooth Newton method. We begin this section by introducing the PPA [24], i.e., given $\Omega_{0},\,\Theta_{0}\in\mathbb{Z}_{++}$ and $\sigma_{0}>0$ , the updating scheme is given by

[TABLE]

where

[TABLE]

with $\delta_{\mathbb{F}}$ being the indicator function of the set $\mathbb{F}:=\{(\Omega,\Theta)\in\mathbb{Z}\times\mathbb{Z}\,|\,\Omega-\Theta=0\}$ , i.e., $\delta_{\mathbb{F}}(\Omega,\Theta)=0,$ if $(\Omega,\Theta)\in\mathbb{F}$ , and $\delta_{\mathbb{F}}(\Omega,\Theta)=\infty,$ if $(\Omega,\Theta)\notin\mathbb{F}.$

We allow for inexactness in the updating scheme (4.1) and apply the standard criteria proposed by Rockafellar [24] for controlling the inexactness: given nonnegative summable sequences $\{\varepsilon_{t}\}$ and $\{\gamma_{t}\}$ such that $\gamma_{t}<1$ for all $t\geq 0$ ,

[TABLE]

Since the exact minimizer $P_{t}(\Omega_{t},\Theta_{t})$ is typically unknown in each iteration, we should introduce practically implementable stopping criteria in place of (A) and (B) in the subsequent analysis.

4.1 Dual based semismooth Newton method for solving PPA subproblems

In this section, we aim to design the aforementioned dual based semismooth Newton method for solving the subproblems of the PPA framework (4.1):

[TABLE]

The Lagrangian function associated with problem (4.3) is given by

[TABLE]

and the Lagrangian dual problem of (4.3) is

[TABLE]

Since we aim to solve problem (4.3) via its dual problem (4.4), a natural idea is to find the explicit expression for the dual objective function $\Upsilon_{t}$ first. The explicit expression for $\Upsilon_{t}$ can be obtained from the Moreau–Yosida regularization as follows:

[TABLE]

The last equality is achieved when $\Omega^{(k)}=\phi_{\sigma_{t}}^{+}\big{(}\Omega_{t}^{(k)}-\sigma_{t}(S^{(k)}+X^{(k)})\big{)}$ , for $k=1,2,\ldots,K$ , and $\Theta={\rm Prox}_{\sigma_{t}\mathcal{P}}({\Theta}_{t}+\sigma_{t}X)$ . One can find that $\Upsilon_{t}$ is continuously differentiable and strongly concave, and the unique solution to problem (4.4) can be obtained by solving the following nonsmooth system

[TABLE]

where

[TABLE]

with

[TABLE]

We can see that if $X=\arg\min_{X}\Upsilon_{t}(X)$ , then one has that $\Omega=\Theta$ with $\Omega^{(k)}=\phi_{\sigma_{t}}^{+}\big{(}W^{(k)}_{t}(X)\big{)}$ , for $k=1,2,\ldots,K$ , $\Theta={\rm Prox}_{\sigma_{t}\mathcal{P}}(V_{t}(X))$ . Recall that $\phi_{\sigma_{t}}^{+}(\cdot)$ is differentiable and its derivative is given by Proposition phiprime. Thus, the surrogate generalized Jacobian $\widehat{\partial}(\nabla\Upsilon_{t})(X){:\,\mathbb{Z}\rightrightarrows\mathbb{Z}}$ of $\nabla\Upsilon_{t}$ at any $X$ can be defined as follows:

[TABLE]

Based on the surrogate generalized Hessian $\widehat{\partial}(\nabla\Upsilon_{t})(\cdot)$ of $\Upsilon_{t}$ , one can apply the following dual based semismooth Newton method (Algorithm 1) for solving problem (4.4) via solving the nonsmooth equation (4.5). To know more about the local and global methods for nonsmooth equations, we refer the readers to [9, Sections 7 & 8] and references therein. The main computational cost of Algorithm 1 lies in Step 1 for finding the Newton direction. Therefore, we carefully analyze the second order sparsity structure in the surrogate generalized Jacobian and fully exploit the structure to reduce the cost. Due to the computation of $\phi_{\sigma_{t}}^{+}(\cdot)$ in $\Upsilon_{t}$ and $\nabla\Upsilon_{t}$ , the $j$ -th iteration of Algorithm 1 requires $Km_{j}$ computations of eigenvalue decompositions.

The following proposition states that Algorithm 1 for solving the dual of the PPA subproblem (4.4) is globally convergent and locally superlinearly or even quadratically convergent if the parameter $\tau$ is chosen to be $1$ .

Proposition 4.1.

Let $\{X_{t,j}\}_{j\geq 0}$ be the infinite sequence generated by Algorithm 1. Then $\{X_{t,j}\}_{j\geq 0}$ converges to the unique optimal solution $\overline{X}_{t}$ of (4.4), and the convergence rate is at least superlinear:

[TABLE]

Proof.

Since ${\rm Prox}_{\mathcal{P}}$ is directionally differentiable, it follows from Proposition 2.1 that ${\rm Prox}_{\mathcal{P}}$ is strongly semismooth with respect to the multifunction $\widehat{\partial}{\rm Prox}_{\mathcal{P}}$ in (2.11) (for its definition, see e.g., [17, Definition 1]). Therefore, the conclusion follows from the strong concavity of $\Upsilon_{t}(\cdot)$ , Proposition 2.3, and [17, Theorem 3]. ∎

4.2 Implementable stopping criteria for PPA subproblems

Due to the lack of explicit forms of the exact solution $P_{t}(\Omega_{t},\Theta_{t})$ , the stopping conditions (A) and (B) need to be replaced by some implementable conditions. Since $\Phi_{\sigma_{t}}$ defined by (4.2) is strongly convex with modulus $1/{2\sigma_{t}}$ , one has the estimate

[TABLE]

which implies that

[TABLE]

The unknown value $\inf\Phi_{\sigma_{t}}(\Omega,\Theta)$ can be replaced by any of its lower bounds converging to it. One choice is to consider the objective value of the dual problem (4.4). In particular, one has that

[TABLE]

and hence

[TABLE]

Therefore, we can terminate Algorithm 1 if $(\Omega_{t+1},\Theta_{t+1},X_{t+1})$ satisfies the following conditions: given nonnegative summable sequences $\{\varepsilon_{t}\}$ and $\{\gamma_{t}\}$ such that $\gamma_{t}<1$ for all $t\geq 0$ ,

[TABLE]

4.3 Linear rate convergence of PPDNA

Now, we are ready to formally present the promised PPDNA for solving problem (3.2).

Along the line of Rockafellar’s works [23, 24], the local linear convergence rate of the primal and dual iterative sequences generated by the PPA can be guaranteed by the Lipschitz continuity of the KKT solution mapping near the origin under proper stopping criteria of the PPA subproblems. However, the Lipschitz property of the KKT solution mapping requires the uniqueness of the KKT point, and this property is not straightforward to establish when the regularizer $\mathcal{P}$ is not a piecewise linear-quadratic function. As the property to ensure the linear convergence rate, especially the uniqueness assumption, is too restrictive to hold, Luque [20] extended the results and proved the local linear convergence of the PPA under the local upper Lipschitz continuity (see e.g., [22, p. 208] for the definition) of the KKT solution mapping at the origin [7, p. 387]. The local upper Lipschitz continuity condition does not make the assumption on the uniqueness of the solution. However, the local upper Lipschitz continuity property may not hold when the KKT solution mapping is not piecewise polyhedral. Fortunately, for our GGL model, the strict convexity of the log-determinant function guarantees the uniqueness of the solution, and we prove in Theorem 3.1 that the KKT solution mapping $\mathcal{S}$ of the GGL model (defined by (3.6)) is Lipschitz continuous near the origin by taking advantage of the nice properties of the log-determinant function and Clarke’s implicit function theorem. Therefore, the local linear convergence rate of the PPDNA can be obtained via the classical results by Rockafellar. The convergence results of Algorithm 2 for solving problem (3.2) are presented below.

Theorem 4.1.

Let $\{(\Omega_{t},\Theta_{t},X_{t})\}_{t\geq 0}$ be an infinite sequence generated by Algorithm 2 under stopping criterion ${\rm(A^{\prime})}$ . Then the sequence $\{(\Omega_{t},\Theta_{t})\}_{t\geq 0}$ converges to the unique solution $(\overline{\Omega},\overline{\Theta})$ of (3.2), and the sequence $\{X_{t}\}_{t\geq 0}$ converges to the unique solution $\overline{X}$ of (3.3). Furthermore, if both criteria ${\rm(A^{\prime})}$ and ${\rm(B^{\prime})}$ are executed in Algorithm 2, then there exists $\bar{t}\geq 0$ such that for all $t\geq\bar{t}$ , one has that

[TABLE]

where the convergence rate is given by

[TABLE]

and the parameter $\kappa$ is from (3.8).

Proof.

The global convergence of Algorithm 2 can be obtained from (4.6), [24, Theorem 1], and the uniqueness of the KKT point. The linear rate of convergence can be derived from (4.6), Theorem 3.1 ${\rm(b)}$ , and [24, Theorem 2]. The proof is completed. ∎

4.4 Extensions of PPDNA

Although the theoretical analysis and the algorithmic design presented in section 3 and section 4 focus on the GGL regularizer, these results can also be applied to the joint graphical model (1.1) with a different regularizer satisfying the following conditions:

(a)

the regularizer is convex and positively homogenous, (e.g., a norm function);

(b)

the proximal mapping associated with the regularizer can be efficiently computed and its surrogate generalized Jacobian can be explicitly characterized.

For example, we can show that both the pairwise fused graphical Lasso regularizer [8, Equation (5)] and the sequential fused graphical Lasso regularizer [34, Formula (2.2)] satisfy the conditions (a) and (b). More specifically, let $\lambda_{1}$ and $\lambda_{2}$ be positive parameters. The pairwise fused graphical Lasso regularizer and sequential fused graphical Lasso regularizer are given as follows:

[TABLE]

where $\varphi_{1}(x)=\lambda_{1}\|x\|_{1}+\lambda_{2}\sum_{k<k^{\prime}}|x_{k}-x_{k^{\prime}}|,\,\,x\in\mathbb{R}^{K}$ , and

[TABLE]

where $\varphi_{2}(x)=\lambda_{1}\|x\|_{1}+\lambda_{2}\sum^{K}_{k=2}|x_{k}-x_{k-1}|,\,\,x\in\mathbb{R}^{K}.$

By applying the same procedure as in section 2.1 for the GGL regularizer, we can obtain the proximal mappings associated with $\mathcal{P}_{i},\,i=1,2$ and their surrogate generalized Jacobians from that of the clustered Lasso regularizer [18] and the fused lasso regularizer [17], respectively. Therefore, we can apply our PPDNA framework for solving the joint graphical model with a different regularizer given by either (4.7) or (4.8).

In addition to the direct extensions to the two regularizers above, the PPDNA framework is also applicable to joint graphical models with other regularizers discussed in [13]. More specifically, the regularizers in [13] have the following form:

[TABLE]

where $\mathcal{Q}_{1}(\Theta):=\lambda_{1}\sum^{K}_{k=1}\sum_{i\neq j}|\Theta^{(k)}_{ij}|,\,\,\mathcal{Q}_{2}(\Theta):=\lambda_{2}\sum^{K}_{k=2}\psi(\Theta^{(k)}-\Theta^{(k-1)}).$ All the choices of the penalty function $\psi$ in [13, Section 2.1] can ensure condition (a) except for the Laplcacian penalty $\psi(\cdot)=\|\cdot\|^{2}$ . Therefore, except for the case of the Laplacian penalty, the PPDNA framework can be directly applied once condition (b) holds. For the exceptional case, we may slightly modify our framework. Specifically, each iteration should be modified as follows: given $\Omega_{0},\,\Theta_{0},\,\Lambda_{0}\in\mathbb{Z}_{++}$ and $\sigma_{0}>0$ , the updating scheme is given by

[TABLE]

where

[TABLE]

with $\delta_{\mathbb{F}}$ being the indicator function of the set

[TABLE]

Then the resulting modified PPDNA can be obtained by using arguments similar to those in section 4. But we should mention that further investigation will be necessary to overcome the underlying difficulty that the dual of the subproblems of (4.9) may not necessarily be strongly convex. We leave this part as our future research topic.

5 Numerical results

In this section, we evaluate the performance of the PPDNA in comparison with the ADMM and the proximal Newton-type method implemented in the work [34], which is referred to as MGL111The solver is available at http://senyang.info/.. All the experiments are performed in Matlab (version 9.7) on a Windows workstation (24-core, Intel Xeon E5-2680 @ 2.50GHz, 128 GB of RAM).

5.1 Implementation of ADMM

In this section, we briefly describe the ADMM for solving the dual problem (3.3), which can be equivalently written as follows:

[TABLE]

Given a parameter $\sigma>0$ , the augmented Lagrangian function associated with (5.1) is defined by

[TABLE]

and the KKT optimality conditions are

[TABLE]

The iterative scheme of the ADMM for problem (5.1) can be described as follows: given $\tau\in(0,(1+\sqrt{5})/2)$ and an initial point $(X_{0},Z_{0},\Theta_{0})\in\mathbb{Z}_{++}\times\mathbb{Z}_{++}\times\mathbb{Z}_{++}$ , the $t$ -th iteration is given by

[TABLE]

where $X_{t+1}$ can be updated by $X_{t+1}=(Z_{t}+{\sigma}^{-1}\Theta_{t}-S)-{\rm Prox}_{\mathcal{P}}(Z_{t}+{\sigma}^{-1}\Theta_{t}-S)$ , and $Z_{t+1}=(Z^{(1)}_{t+1},\dots,Z^{(K)}_{t+1})$ can be updated by $Z^{(k)}_{t+1}=\phi^{+}_{{\sigma}^{-1}}\big{(}X^{(k)}_{t}-\frac{1}{\sigma}\Theta^{(k)}_{t}+S^{(k)}\big{)}$ , $k=1,\ldots,K$ .

In the practical implementation, we tuned the parameter $\sigma$ according to the progress of primal and dual feasibilities (see e.g., [15, Section 4.4]) and used a larger step-length $\tau$ of $1.618$ . These two techniques can empirically accelerate the convergence speed. It is worth noting that the ADMM implemented by Yang et al. [34] used a fixed penalty parameter $\sigma$ and the step-length $\tau=1$ .

5.2 Settings of experiments

The experimental settings are the same as those in [38, Section 4]. We adopt the stopping criteria of PPDNA, ADMM and MGL as below. Let $\epsilon>0$ be a given tolerance. It is set as $10^{-6}$ in the following experiments.

•

The PPDNA is terminated if $\eta_{P}\leq\epsilon$ , where

[TABLE]

with

[TABLE]

•

The ADMM is terminated when $\eta_{A}\leq\epsilon$ or $20000$ iterations are taken, where

[TABLE]

•

The MGL is terminated when the relative difference of its objective value with respect to the primal objective value obtained by the PPDNA is smaller than the given tolerance $\epsilon$ or the relative duality gap achieved by the PPDNA, i.e.,

[TABLE]

where ${\rm pobj}_{P}$ , ${\rm pobj}_{M}$ , and ${\rm relgap}_{P}$ are the objective values obtained by the PPDNA, the MGL, and the relative duality gap attained by the PPDNA.

It is worth mentioning that we adopt a warm-starting technique in the initial stage of the PPDNA, instead of starting it from scratch. The warm-starting procedure consists of first running the ADMM (with identity matrices as the starting point) for a fixed number of iterations (3000 steps in our experiments) or up to a given tolerance ( $100\epsilon$ in our experiments), and then using the resulting approximate solution as an initial point to warm-start the PPDNA. This idea is greatly motivated by two facts: 1) the ADMM can generate a solution of low to medium accuracy efficiently and might become slow when higher accuracy is required; 2) our algorithm PPDNA has been proven to be locally linearly convergent. Therefore, the warm-starting technique can integrate the advantages of both ADMM and PPDNA.

We set the initial parameter in the stopping criterion ${\rm(A^{\prime})}$ to be $\varepsilon_{0}=0.5$ , and decrease it by a ratio $\varsigma>1$ , i.e., $\varepsilon_{k+1}=\varepsilon_{k}/\varsigma$ . Likewise, the parameter $\gamma_{k}$ in the stopping criterion ${\rm(B^{\prime})}$ is updated in the same fashion as that for $\varepsilon_{k}$ . For the parameters in Algorithm 1, we simply set $\bar{\eta}=0.1$ and $\tau\in[0.1,0.2]$ according to [40]. For the step on line search, we set $\mu=10^{-4}$ and $\rho=0.5$ .

5.3 Descriptions of datasets

In this part, we describe the datasets which will be used later. Since these datasets have been discussed in [38], we briefly review them for the ease of reading:

•

University webpages data set222http://ana.cachopo.org/datasets-for-single-label-text-categorization: The original data was collected from computer science departments of various universities in 1997, manually classified into seven different classes: Student, Faculty, Course, Project, Staff, Department, and Other. The data we use, consisting of the first four classes, is preprocessed by stemming techniques [4]. Two thirds of the pages were randomly chosen for training (Webtrain) and the remaining third for testing (Webtest).

•

20 newsgroups data set333http://qwone.com/$\sim$jason/20Newsgroups/: This data set has 20 topics of newsgroup documents, and some of the topics are closely related to each other, while others are highly unrelated. Four subgroups are named as NGcomp, NGrec, NGsci, and NGtalk accordingly and will be used in our experiments.

•

SPX500 component stocks444www.yahoo.com: This data set contains the daily returns of Standard & Poor’s 500 (SPX500) constituents from 2004 to 2014. We also test on extracted data from 2004 to 2006.

5.4 Performance of PPDNA

In this part, we first give an elementary report of the effectiveness of the GGL model on synthetic nearest-neighbor networks generated by the mechanism in [16]. Second, we illustrate numerically the local linear convergence of the PPDNA for solving two representative instances, in correspondence with Theorem 4.1 which shows theoretically the local linear convergence of the PPDNA.

5.4.1 Synthetic data: nearest-neighbor networks

In this example, we choose $p=500$ and $K=3$ . The synthetic precision matrices, denoted as $\Sigma^{(k)},\,k=1,\dots,K$ , are generated as follows. We first generate $p$ points on a unit square randomly, calculate their pairwise distances, and identify $5$ nearest neighbors of each point. The nearest-neighbor network is then obtained by linking any two points that are $5$ nearest neighbors of each other, and we denote the number of its edges as $N$ . Subsequently, we obtain each $\Sigma^{(k)},\,k=1,2,3$ by adding extra edges to the common nearest-neighbor network. For each $k$ , a pair of symmetric zero elements is randomly selected from the nearest-neighbor network and replaced with a value uniformly drawn from the interval $[-1,-0.5]\cup[0.5,1]$ . $\Sigma^{(k)}$ is obtained after this procedure is repeated ceil $(N/4)$ times. We find in our simulation that the true number of edges in the three networks is $3690$ . Given the precision matrices, we draw $10000$ samples from each Gaussian distribution $\mathcal{N}_{p}(0,(\Sigma^{(k)})^{-1})$ to compute the sample covariance matrices. Next we specify the tuning parameters $\lambda_{1}$ and $\lambda_{2}$ . Following [8], we reparameterize $\lambda_{1}$ and $\lambda_{2}$ in order to separate the regularization for “sparsity” and for “similarity” since both parameters contribute to sparsity: $\lambda_{1}$ drives individual network edges to zero whereas $\lambda_{2}$ drives network edges to zero across all $K$ network estimates at the same time. We reparameterize them in terms of $w_{1}=\lambda_{1}+\frac{1}{\sqrt{2}}\lambda_{2},\,\,w_{2}=\frac{1}{\sqrt{2}}\lambda_{2}/(\lambda_{1}+\frac{1}{\sqrt{2}}\lambda_{2}),$ which are found in [8] to reflect the levels of sparsity and similarity regularization and are called the sparsity and similarity control parameters, respectively. In order to show the diversity of sparsity in our experiments, we change $w_{1}$ with $w_{2}$ fixed. Figure 2 characterizes the relative abilities of the GGL model to recover the network structures and to detect change-points.

Figure 2(a) displays the number of true positive (TP) edges selected against the number of false positive (FP) edges. We say that an edge $(i,j)$ is selected in the estimate $\overline{\Theta}^{(k)}$ if $\overline{\Theta}^{(k)}_{ij}\neq 0$ , and the edge is true if $\Sigma^{(k)}_{ij}\neq 0$ and false if $\Sigma^{(k)}_{ij}=0$ . We can see that the model with $w_{2}=0.2$ can recover almost all of the TP edges without FP edges. This suggests that the GGL model is effective for recovering the edges in the nearest-neighbor networks. Figure 2(b) illustrates the sum of squared errors between estimated edge values and true edge values, i.e., $\sum_{k=1}^{K}\sum_{i<j}\big{(}\overline{\Theta}^{(k)}_{ij}-\Sigma^{(k)}_{ij}\big{)}^{2}$ . When the number of the total edges selected increases (i.e., the sparsity control parameter $w_{1}$ decreases), the error decreases and finally reaches a fairly low value. Figure 2(c) plots the number of TP differential edges against FP differential. An edge that differs between networks is called a differential edge and thus corresponds to a change-point. Numerically, we say that the $(i,j)$ edge is estimated to be differential between the $k$ -th and the $(k+1)$ -th networks if $|\overline{\Theta}^{(k)}_{ij}-\overline{\Theta}^{(k+1)}_{ij}|>10^{-6}$ , and we say that it is truly differential if $|\Sigma^{(k)}_{ij}-\Sigma^{(k+1)}_{ij}|>10^{-6}$ . The number of differential edges is computed for all successive pairs of networks. One can observe in Figure 2(c) that the results obtained with $w_{2}=0.2$ have approximately 3000 TP differential edges and almost no false ones. This suggests that the GGL model can be a suitable model to use in change-point detection of nearest-neighbor networks.

5.4.2 Linear rate convergence

The purpose of this section is to demonstrate numerically the local linear convergence of the PPDNA. Specifically, we conduct experiments on two representative instances: (a) categorical data: Webtrain with $(p,K)=(300,4)$ , $(\lambda_{1},\lambda_{2})=(5$ e- $3,5$ e- $4)$ ; (b) time-varying data: SPX500 with $(p,K)=(200,11)$ , $(\lambda_{1},\lambda_{2})=(5$ e- $4,5$ e- $5)$ . Due to the lack of exact optimal solutions of these instances, we run the PPDNA until the accuracy of $10^{-10}$ is achieved and regard the resulting approximate solution as the true solution $(\overline{\Omega},\overline{\Theta},\overline{X})$ . We denote

[TABLE]

In Figure 3, we plot $\log_{10}d_{t}$ against the iteration count $t$ under two different choices of the penalty parameter $\sigma_{t}$ : $\sigma_{t}$ is fixed or increased by a ratio. When $\sigma_{t}$ is fixed, the solid blue line in the figure indicates that the convergence rate is almost constant. When $\sigma_{t+1}=1.3\sigma_{t}$ , i.e., the penalty parameter is gradually increasing, the dash-dotted red line shows that the convergence rate is increasingly fast. The observation is consistent with Theorem 4.1, which demonstrates numerically the local linear convergence rate of the PPDNA. We should emphasize that the impressive linear convergence rate depicted in the solid blue curve in Figure 3(a) is attained with $\sigma_{t}$ fixed at a large value of $10^{8}$ , whereas the slower initial convergence shown in the dash-dotted red curve is due to slowly increasing the parameter $\sigma_{t}$ from a small initial value of $2\times 10^{4}$ . The same remark is also applicable to Figure 3(b).

The dependence of the linear rate of convergence on $\sigma_{t}$ also sheds light on the choice of $\sigma_{t}$ in our implementation. Basically we adaptively update $\sigma_{t}$ to strike a good balance in the trade-off between the convergence rate of the PPDNA and the difficulty in computing the Newton directions (via the CG method) in the semismooth Newton method (Step 1 of Algorithm 1). As the condition number of the Newton linear system in Step 1 of Algorithm 1 is proportional to $\sigma_{t}$ , the CG method will converge more slowly for a larger $\sigma_{t}$ . Thus in our experiments, we start from a small $\sigma_{0}$ , e.g., $\sigma_{0}=1$ , and gradually increase $\sigma_{t}$ by some factor $\zeta>1$ , i.e., $\sigma_{t+1}=\zeta\sigma_{t}$ .

5.5 Comparison with ADMM and MGL

In this section, we compare our algorithm PPDNA for solving the GGL model with the ADMM described in (5.2) and the MGL implemented in [34]. For the tuning parameters $\lambda_{1}$ and $\lambda_{2}$ , we select three pairs for each instance that produce reasonable sparsity. In the following tables, “P” stands for PPDNA; “A” stands for ADMM; “M” stands for MGL. In the column under “Iteration”, we report the number of iterations taken by various algorithms. In particular, for the PPDNA, we report the number of the PPA iterations taken and the number (within the parentheses) of semismooth Newton linear systems solved. Let “nnz” denote the number of nonzero entries in the solution $\Theta$ obtained by the PPDNA using the following estimation: ${\rm nnz}:=\min\{k\,|\,\sum_{i=1}^{k}|\widehat{x}_{i}|\geq 0.999\|\widehat{x}\|_{1}\},$ where $\widehat{x}\in\mathbb{R}^{p^{2}K}$ is the vector obtained by sorting all elements of $\Theta$ by magnitude in a descending order. In the tables, “density” denotes the quantity nnz $/(p^{2}K)$ . The time is displayed in the format of “hours:minutes:seconds”, and the fastest method is highlighted in red. The error reported for the PPDNA in the tables is the relative KKT residual $\eta_{P}$ . That of the ADMM is $\eta_{A}$ ; while the error for the MGL is $\Delta_{M}$ .

Table 1 shows the comparison of three methods PPDNA, ADMM, and MGL on the university webpages data sets. The PPDNA successfully solved all instances in Table 1 within about one minute. For a large majority of tested instances, the PPDNA is faster than the ADMM and the MGL. It suggests that the PPDNA is robust and efficient for solving the GGL model applied to the university webpages data.

Table 2 presents the comparison of PPDNA, ADMM, and MGL on the 20 newsgroups data sets. One can see clearly that the PPDNA outperforms the ADMM and the MGL for most instances in Table 2. It demonstrates that the PPDNA can be efficient for solving the GGL model. For some difficult instances, e.g., NGcomp train $(\lambda_{1},\lambda_{2})=(5$ e- $4,5$ e- $5)$ , our PPDNA took less than one minute while the MGL took more than one hour. Again, the results show that our PPDNA is robust for solving the GGL model. The superior performance of our PPDNA can primarily be attributed to our ability to extract and exploit the sparsity structure (in $\widehat{\partial}{\rm Prox}_{\mathcal{P}}$ ) within the semismooth Newton method to solve the PPA subproblems very efficiently.

Table 3 gives the results on the Standard & Poor’s 500 component stock price data set SPX500. The table shows that the PPDNA is faster than both the ADMM and the MGL for all instances. In addition, we find that both the PPDNA and the ADMM succeeded in solving all instances, while the MGL failed to solve one of them within three hours. This might imply that the MGL is not robust for solving the GGL model when applied to the stock price data sets. The numerical results show convincingly that our algorithm PPDNA can solve the GGL problem highly efficiently and robustly.

6 Concluding remarks

In this paper, we have taken advantage of the ideas proposed in [17, 39] and implemented a proximal point dual Newton algorithm (PPDNA) to the primal formulation of the group graphical Lasso problems. From a theoretical standpoint, we have shown that the PPDNA is globally convergent and the sequence of primal and dual iterates is Q-linearly convergent, although the group graphical Lasso regularizer is non-polyhedral. The robustness and superior numerical efficiency of the PPDNA are convincingly demonstrated in various numerical experiments. Therefore, we can firmly conclude that the PPDNA is not only a fast method with nice theoretical guarantees, but also a numerically efficient method for solving the group graphical Lasso problems with multiple precision matrices.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont , Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data , Journal of Machine Learning Research, 9 (2008), pp. 485–516.
2[2] M. Bollh o ¨ ¨ 𝑜 \ddot{o} fer, A. Eftekhari, S. Scheidegger, and O. Schenk , Large-scale sparse inverse covariance matrix estimation , SIAM Journal on Scientific Computing, 41 (2019), pp. A 380–A 401.
3[3] J. F. Bonnans and A. Shapiro , Perturbation Analysis of Optimization Problems , Springer Science & Business Media, 2000.
4[4] A. Cardoso-Cachopo , Improving methods for single-label text categorization . Ph D Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.
5[5] F. H. Clarke , Optimization and Nonsmooth Analysis , vol. 5, SIAM, 1990.
6[6] F. H. Clarke, Y. S. Ledyaev, R. J. Stern, and P. R. Wolenski , Nonsmooth Analysis and Control Theory , vol. 178, Springer, New York, NY, 1998.
7[7] Y. Cui, D. F. Sun, and K.-C. Toh , On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming , Mathematical Programming, 178 (2019), pp. 381–415.
8[8] P. Danaher, P. Wang, and D. M. Witten , The joint graphical lasso for inverse covariance estimation across multiple classes , Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 (2014), pp. 373–397.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso Problems

Abstract

1 Introduction

2 Preliminaries

2.1 Proximal mapping of the GGL regularizer and its generalized Jacobian

Proposition 2.1**.**

2.2 Properties of the log-determinant function

Proposition 2.2**.**

Proposition 2.3**.**

3 Lipschitz continuity of the KKT solution mapping

Lemma 3.1**.**

Proof.

Lemma 3.2**.**

Theorem 3.1**.**

Proof.

4 Proximal point dual Newton algorithm

4.1 Dual based semismooth Newton method for solving PPA subproblems

Proposition 4.1**.**

Proof.

4.2 Implementable stopping criteria for PPA subproblems

4.3 Linear rate convergence of PPDNA

Theorem 4.1**.**

Proof.

4.4 Extensions of PPDNA

5 Numerical results

5.1 Implementation of ADMM

5.2 Settings of experiments

5.3 Descriptions of datasets

5.4 Performance of PPDNA

5.4.1 Synthetic data: nearest-neighbor networks

5.4.2 Linear rate convergence

5.5 Comparison with ADMM and MGL

6 Concluding remarks

Proposition 2.1.

Proposition 2.2.

Proposition 2.3.

Lemma 3.1.

Lemma 3.2.

Theorem 3.1.

Proposition 4.1.

Theorem 4.1.