A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso Problems
Yangjing Zhang, Ning Zhang, Defeng Sun, Kim-Chuan Toh

TL;DR
This paper introduces an efficient proximal point dual Newton algorithm for solving the group graphical Lasso model, enabling simultaneous learning of multiple related graphical models with shared sparsity patterns.
Contribution
The paper proposes a novel PPDNA method for the non-polyhedral group graphical Lasso, achieving superlinear convergence and demonstrating high efficiency and robustness in experiments.
Findings
Superlinear convergence of PPDNA for group graphical Lasso.
High efficiency and robustness shown in numerical experiments.
Effective in learning multiple related graphical models.
Abstract
Undirected graphical models have been especially popular for learning the conditional independence structure among a large number of variables where the observations are drawn independently and identically from the same distribution. However, many modern statistical problems would involve categorical data or time-varying data, which might follow different but related underlying distributions. In order to learn a collection of related graphical models simultaneously, various joint graphical models inducing sparsity in graphs and similarity across graphs have been proposed. In this paper, we aim to propose an implementable proximal point dual Newton algorithm (PPDNA) for solving the group graphical Lasso model, which encourages a shared pattern of sparsity across graphs. Though the group graphical Lasso regularizer is non-polyhedral, the asymptotic superlinear convergence of our proposed…
| Problem | Density | Iteration | Time | Error | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| P | A | M | P | A | M | P | A | M | |||
| (1e-02,1e-03) | 0.016 | 16(24) | 501 | 4 | 02 | 04 | 10 | 3.2e-07 | 9.9e-07 | 2.9e-07 | |
| Webtest | (5e-03,5e-04) | 0.048 | 16(25) | 501 | 6 | 02 | 04 | 13 | 3.2e-07 | 9.9e-07 | 2.6e-07 |
| (100,4) | (1e-03,1e-04) | 0.225 | 14(22) | 529 | 32 | 02 | 04 | 56 | 2.4e-07 | 9.9e-07 | 1.0e-06 |
| (1e-02,1e-03) | 0.008 | 14(24) | 850 | 5 | 08 | 25 | 37 | 7.9e-07 | 1.0e-06 | 6.7e-08 | |
| Webtest | (5e-03,5e-04) | 0.026 | 14(27) | 679 | 7 | 10 | 19 | 50 | 7.9e-07 | 9.8e-07 | 4.1e-07 |
| (200,4) | (1e-03,1e-04) | 0.163 | 13(23) | 503 | 77 | 07 | 11 | 05:57 | 2.7e-07 | 9.9e-07 | 1.6e-06 |
| (5e-03,5e-04) | 0.016 | 14(29) | 744 | 8 | 28 | 33 | 02:39 | 5.6e-07 | 9.9e-07 | 5.3e-08 | |
| Webtest | (1e-03,1e-04) | 0.125 | 16(32) | 487 | 205 | 45 | 22 | 21:51 | 5.9e-07 | 9.9e-07 | 1.8e-06 |
| (300,4) | (5e-04,5e-05) | 0.256 | 14(35) | 668 | 1128 | 55 | 30 | 01:37:51 | 3.9e-07 | 9.9e-07 | 2.1e-06 |
| (1e-02,1e-03) | 0.012 | 20(34) | 1601 | 3 | 03 | 11 | 08 | 1.2e-07 | 1.0e-06 | 6.0e-06 | |
| Webtrain | (5e-03,5e-04) | 0.033 | 20(34) | 1601 | 5 | 03 | 12 | 04 | 1.2e-07 | 1.0e-06 | 7.4e-07 |
| (100,4) | (1e-03,1e-04) | 0.165 | 20(34) | 1601 | 22 | 04 | 13 | 43 | 1.2e-07 | 1.0e-06 | 7.8e-06 |
| (5e-03,5e-04) | 0.016 | 20(39) | 1325 | 7 | 13 | 27 | 01:18 | 1.3e-07 | 1.0e-06 | 1.3e-06 | |
| Webtrain | (1e-03,1e-04) | 0.108 | 20(37) | 1397 | 31 | 11 | 31 | 02:49 | 9.9e-08 | 1.0e-06 | 5.0e-06 |
| (200,4) | (5e-04,5e-05) | 0.219 | 20(39) | 1397 | 88 | 16 | 32 | 05:00 | 1.1e-07 | 1.0e-06 | 5.3e-06 |
| (5e-03,5e-04) | 0.011 | 20(62) | 1826 | 10 | 01:04 | 01:37 | 08:35 | 2.4e-07 | 9.7e-07 | 2.7e-06 | |
| Webtrain | (1e-03,1e-04) | 0.080 | 19(33) | 1196 | 45 | 21 | 52 | 09:55 | 3.4e-07 | 1.0e-06 | 6.0e-06 |
| (300,4) | (5e-04,5e-05) | 0.177 | 19(36) | 1196 | 134 | 36 | 52 | 13:00 | 3.7e-07 | 1.0e-06 | 6.0e-06 |
| Problem | Density | Iteration | Time | Error | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| P | A | M | P | A | M | P | A | M | |||
| NGcomp | (5e-03,5e-04) | 0.021 | 15(22) | 509 | 31 | 16 | 26 | 37:08 | 6.5e-08 | 9.9e-07 | 1.1e-06 |
| test | (1e-03,1e-04) | 0.099 | 16(26) | 625 | 510 | 32 | 34 | 01:20:37 | 7.9e-07 | 1.0e-06 | 2.0e-06 |
| (300,5) | (5e-04,5e-05) | 0.210 | 14(24) | 494 | 1481 | 40 | 27 | 03:00:00 | 7.2e-07 | 1.0e-06 | 6.1e-06 |
| NGrec | (5e-03,5e-04) | 0.004 | 21(38) | 1331 | 5 | 15 | 49 | 04:04 | 8.0e-08 | 1.0e-06 | 4.9e-07 |
| test | (1e-03,1e-04) | 0.063 | 21(39) | 1331 | 13 | 20 | 58 | 04:28 | 8.2e-08 | 1.0e-06 | 1.9e-06 |
| (300,4) | (5e-04,5e-05) | 0.143 | 20(37) | 1331 | 36 | 20 | 58 | 07:49 | 3.7e-07 | 1.0e-06 | 3.7e-06 |
| NGsci | (5e-03,5e-04) | 0.006 | 17(26) | 542 | 6 | 14 | 21 | 05:25 | 3.8e-07 | 1.0e-06 | 2.1e-06 |
| test | (1e-03,1e-04) | 0.075 | 17(27) | 553 | 21 | 19 | 24 | 11:17 | 3.9e-07 | 9.8e-07 | 2.1e-06 |
| (300,4) | (5e-04,5e-05) | 0.167 | 17(31) | 550 | 87 | 25 | 24 | 17:13 | 5.0e-07 | 9.9e-07 | 2.6e-06 |
| NGtalk | (5e-03,5e-04) | 0.026 | 15(25) | 482 | 16 | 24 | 26 | 26:08 | 9.2e-08 | 9.6e-07 | 4.1e-07 |
| test | (1e-03,1e-04) | 0.115 | 12(23) | 278 | 81 | 20 | 13 | 13:39 | 1.2e-07 | 9.9e-07 | 1.1e-06 |
| (300,3) | (5e-04,5e-05) | 0.240 | 11(22) | 286 | 337 | 25 | 13 | 40:28 | 9.4e-08 | 9.7e-07 | 2.2e-06 |
| NGcomp | (5e-03,5e-04) | 0.016 | 16(31) | 1150 | 13 | 33 | 57 | 14:58 | 1.2e-07 | 1.0e-06 | 5.6e-08 |
| train | (1e-03,1e-04) | 0.080 | 15(31) | 1153 | 172 | 35 | 01:04 | 40:11 | 4.6e-07 | 1.0e-06 | 1.9e-06 |
| (300,5) | (5e-04,5e-05) | 0.153 | 15(30) | 1216 | 574 | 33 | 01:07 | 01:12:51 | 4.4e-07 | 1.0e-06 | 1.8e-06 |
| NGrec | (5e-03,5e-04) | 0.005 | 19(35) | 1519 | 5 | 22 | 52 | 02:36 | 1.4e-07 | 1.0e-06 | 3.9e-07 |
| train | (1e-03,1e-04) | 0.068 | 18(37) | 1500 | 16 | 31 | 01:06 | 09:45 | 2.6e-07 | 1.0e-06 | 4.8e-06 |
| (300,4) | (5e-04,5e-05) | 0.124 | 18(35) | 1542 | 48 | 28 | 01:07 | 09:02 | 2.9e-07 | 1.0e-06 | 5.4e-06 |
| NGsci | (5e-03,5e-04) | 0.011 | 17(30) | 1387 | 10 | 21 | 54 | 10:00 | 1.7e-07 | 1.0e-06 | 3.5e-08 |
| train | (1e-03,1e-04) | 0.086 | 16(32) | 1389 | 40 | 32 | 01:01 | 08:57 | 5.1e-07 | 1.0e-06 | 2.6e-06 |
| (300,4) | (5e-04,5e-05) | 0.152 | 16(32) | 965 | 206 | 37 | 42 | 18:10 | 4.1e-07 | 9.9e-07 | 2.9e-06 |
| NGtalk | (5e-03,5e-04) | 0.026 | 18(32) | 2445 | 13 | 26 | 01:41 | 13:24 | 2.6e-07 | 1.0e-06 | 1.9e-06 |
| train | (1e-03,1e-04) | 0.103 | 17(32) | 2448 | 52 | 19 | 01:46 | 13:26 | 1.4e-07 | 1.0e-06 | 4.4e-06 |
| (300,3) | (5e-04,5e-05) | 0.204 | 17(38) | 2385 | 213 | 28 | 01:45 | 19:43 | 7.2e-08 | 1.0e-06 | 4.9e-06 |
| Problem | Density | Iteration | Time | Error | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| P | A | M | P | A | M | P | A | M | |||
| (1e-04,1e-05) | 0.039 | 22(33) | 3644 | 6 | 04 | 22 | 28 | 1.1e-07 | 1.0e-06 | 9.8e-07 | |
| SPX500 | (5e-05,5e-06) | 0.138 | 22(35) | 3646 | 8 | 05 | 23 | 25 | 1.5e-07 | 1.0e-06 | 8.4e-06 |
| (100,3) | (2e-05,2e-06) | 0.238 | 23(43) | 2056 | 18 | 08 | 13 | 08 | 4.4e-07 | 1.0e-06 | 2.5e-05 |
| (1e-04,1e-05) | 0.025 | 24(31) | 1409 | 8 | 12 | 23 | 01:20 | 8.8e-08 | 1.0e-06 | 9.4e-06 | |
| SPX500 | (5e-05,5e-06) | 0.084 | 21(28) | 1239 | 17 | 14 | 20 | 04:23 | 9.4e-08 | 1.0e-06 | 9.0e-06 |
| (200,3) | (2e-05,2e-06) | 0.150 | 20(38) | 1363 | 32 | 18 | 23 | 03:28 | 1.4e-07 | 9.9e-07 | 2.0e-05 |
| (5e-04,5e-05) | 0.030 | 22(29) | 3701 | 11 | 12 | 01:01 | 02:20 | 4.3e-08 | 1.0e-06 | 5.4e-06 | |
| SPX500 | (1e-04,1e-05) | 0.127 | 22(30) | 3722 | 105 | 18 | 01:21 | 02:34 | 9.3e-08 | 1.0e-06 | 7.6e-06 |
| (100,11) | (5e-05,5e-06) | 0.206 | 22(30) | 2925 | 393 | 21 | 01:06 | 09:12 | 7.8e-07 | 1.0e-06 | 7.8e-06 |
| (5e-04,5e-05) | 0.018 | 19(24) | 1096 | 28 | 31 | 53 | 36:07 | 8.4e-07 | 1.0e-06 | 4.1e-06 | |
| SPX500 | (1e-04,1e-05) | 0.082 | 19(24) | 1125 | 481 | 49 | 01:08 | 01:27:17 | 7.9e-07 | 1.0e-06 | 5.1e-06 |
| (200,11) | (5e-05,5e-06) | 0.140 | 19(27) | 1101 | 1258 | 01:05 | 01:08 | 03:00:00 | 6.1e-07 | 1.0e-06 | 1.7e-05 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Proximal Point Dual Newton Algorithm for Solving Group Graphical Lasso Problems
Yangjing Zhang Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076 ([email protected]).
Ning Zhang (Corresponding author) College of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China; Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China ([email protected]). This author is supported in part by the National Natural Science Foundation of China under Grant 11901083
Defeng Sun Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong, China ([email protected]). This author is supported in part by Hong Kong Research Grant Council under Grant PolyU 153014/18P
Kim-Chuan Toh Department of Mathematics, and Institute of Operations Research and Analytics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076 ([email protected]). This author is supported in part by the Academic Research Fund of the Ministry of Education of Singapore under Grant R-146-000-257-112.
(July, 22, 2020)
Abstract
Undirected graphical models have been especially popular for learning the conditional independence structure among a large number of variables where the observations are drawn independently and identically from the same distribution. However, many modern statistical problems would involve categorical data or time-varying data, which might follow different but related underlying distributions. In order to learn a collection of related graphical models simultaneously, various joint graphical models inducing sparsity in graphs and similarity across graphs have been proposed. In this paper, we aim to propose an implementable proximal point dual Newton algorithm (PPDNA) for solving the group graphical Lasso model, which encourages a shared pattern of sparsity across graphs. Though the group graphical Lasso regularizer is non-polyhedral, the asymptotic superlinear convergence of our proposed method PPDNA can be obtained by leveraging on the local Lipschitz continuity of the Karush-Kuhn-Tucker solution mapping associated with the group graphical Lasso model. A variety of numerical experiments on real data sets illustrates that the PPDNA for solving the group graphical Lasso model can be highly efficient and robust.
Keywords. Group Graphical Lasso, Proximal Point Algorithm, Semismooth Newton Method, Lipschitz Continuity
AMS Subject Classification. 90C22, 90C25, 90C31, 62J10.
1 Introduction
Let be given data matrices. For each , the rows of are observations drawn independently from a Gaussian distribution with mean zero, and the empirical covariance matrix for is given by . In this paper, we consider the following joint graphical model:
[TABLE]
where \Theta=\big{(}\Theta^{(1)},\Theta^{(2)},\dots,\Theta^{(K)}\big{)}\in\mathbb{S}^{p}\times\mathbb{S}^{p}\times\cdots\times\mathbb{S}^{p} is the decision variable, and is a convex penalty term that can promote certain desired structure in the decision variable . Throughout this paper, we assume that the solution set to problem (1.1) is nonempty.
If and , problem (1.1) reduces to the well-known sparse Gaussian graphical model which has been studied by various researchers (e.g., [1, 2, 10, 14, 27, 33, 36]). In many applications, a single Gaussian graphical model is typically enough to capture the conditional independence structure of the random variables. However, in some situations it is more reasonable to fit a collection of such models jointly, due to the similarity or heterogeneity of the data involved. These models for estimating multiple precision matrices jointly are referred to as joint graphical models in [8]. A scenario where joint graphical models are more suitable than a single graphical model is when the data comes from several distinct but closely related classes, which share the same collection of variables but differ in terms of the dependency structures. Their dependency graphs can have common edges across a portion of all classes and unique edges restricted to only certain classes. In this case, fitting separate graphical models for distinct classes does not exploit the similarity among the dependency graphs. In contrast, joint estimation of these models could exploit information across different but related classes. In addition to the data from different classes, another scenario that would favor joint graphical models over a single graphical model is when the data contains sequences of multivariate time-stamped observations. Such data might correspond to a series of dependency graphs over time. Next, we give two practical applications of joint graphical models, which will also be used in our numerical experiments:
The inference of words relationships from webpages or newsgroups: the webpages from the computer science departments of various universities are classified into several classes: Student, Faculty, Course, Project, etc. The 20 newsgroups are grouped into various topics.
- -
The inference of time-varying dependency structures of stocks: the dependency structures among the Standard & Poor’s 500 component stocks might change smoothly over time.
In summary, there are two major applications of the joint graphical models: (i) estimating multiple precision matrices jointly for a collection of variables across distinct classes; (ii) inferring the time-varying networks and finding the change-points.
For solving problem (1.1) with different forms of penalty terms, the alternating direction method of multipliers (ADMM) has been extensively used; see, e.g., [8, 12, 13]. As we know, the ADMM could be a fast first order method for finding approximate solutions of low or moderate accuracy. However, for attaining superlinear convergence to compute highly accurate solutions, one has to incorporate at least in part the second order information of the problem. Yang et al. [34] proposed a proximal Newton-type method, where the subproblem in each iteration can be solved by the nonmonotone spectral projected gradient method [19, 32], and an active set identification scheme was applied to reduce the cost. Another notable contribution is that a screening rule, which can be combined with any method to reduce the computational cost, was proposed in [34]. However, the second order method in [34] is not without drawbacks. Each of its subproblems is a complicated quadratic approximation problem, which generally requires expensive computations. Besides, the inexact proximal Newton-type method proposed in [34] has no guarantee of local linear convergence. It is worth noting that, in a recent paper related to [34], Yue, Zhou, and So [37] studied the local convergence rate of a family of inexact proximal Newton-type methods for solving a class of nonsmooth convex composite optimization problems based on an error bound condition. However, it is not clear to us whether the convergence analysis in [37] can be directly applied to problem (1.1) as the Hessian of the first function in the objective of the problem is not uniformly bounded on its effective domain. More recently, Zhang et al. [38] applied a regularized proximal point algorithm (rPPA) to solve a fused multiple graphical Lasso (FGL) model and heavily exploited the underlying second order information through the semismooth Newton method when solving the subproblems of the rPPA. Due to the polyhedral property of the FGL regularizer, the rPPA for solving the FGL problem is proven to have an arbitrary linear convergence rate in [38].
Our goal in this paper is to design and analyze an efficient second order information based algorithm with economical implementations and a fast convergence rate for solving problem (1.1) with the following non-polyhedral regularizer, which was referred to as the group graphical Lasso (GGL) regularizer in [8]:
[TABLE]
where and are positive parameters. We refer to model (1.1) with the regularizer (1.2) as the GGL model. In fact, the GGL regularizer acting on a collection of matrices can be viewed as an extension of the sparse group Lasso regularizer [11, 28] acting on a vector. The former can be regarded as the latter if the -th elements across all precision matrices are assigned into one group. For , we let be the column vector obtained by taking out the -th elements across all matrices . We can observe that
[TABLE]
where the function is actually a special sparse group Lasso regularizer. The first term of the GGL regularizer promotes sparsity in the estimated precision matrices ’s. The zeros in these precision matrices tend to occur at the same indices due to the second term of the GGL regularizer. In addition, Figure 1 illustrates the structure of the decision variable and the vector belonging to one group .
Inspired by the impressive numerical performance of the rPPA for solving the FGL model [38], we will design a proximal point dual Newton algorithm (PPDNA) for solving the GGL model. Specifically, a proximal point algorithm (PPA) [24] is applied to the primal formulation of the GGL model, and a superlinearly convergent semismooth Newton method is designed to solve the dual formulations of the PPA subproblems. Thanks to the fact that the GGL regularizer is an extension of the sparse group Lasso regularizer, the generalized Jacobian of the proximal mapping of the GGL regularizer can be characterized based on that of the sparse group Lasso regularizer, where the explicit form was given in [39]. As a result, the former naturally inherits the structured sparsity (referred to as the second order sparsity) of the latter. Consequently, multiplying a sparse Hessian matrix by a vector in the semismooth Newton method is reasonably cheap, and one could expect that the superlinearly convergent semismooth Newton method is numerically efficient for solving the PPA subproblems. In addition to achieving low cost in computing the semismooth Newton directions by exploiting the second order sparsity, we also establish the linear convergence guarantee of the PPDNA.
Though the framework of the PPDNA for solving the GGL model is closely related to the rPPA for solving the FGL model [38] and the semismooth Newton based augmented Lagrangian method (SSNAL) for solving the sparse group Lasso problems [39], both the theoretical analysis and numerical implementation should be further investigated owing to the following difficulties of the GGL model. First, unlike the FGL regularizer, the GGL regularizer is a non-polyhedral function and consequently the Lipschitz continuity of the Karush-Kuhn-Tucker (KKT) solution mapping associated with the GGL model is not as straightforward to establish as in [38]. We should mention here that the Lipschitz continuity of the KKT solution mapping plays an important role in establishing the convergence rate of the PPDNA, just as in the case of rPPA and SSNAL. Second, the subproblem of the PPDNA for solving the GGL model differs from those of the SSNAL and rPPA which are strongly convex. Therefore, the stopping criteria previously used in SSNAL and rPPA are no longer applicable. The main contributions of this paper can be summarized as follows.
We prove the Lipschitz continuity of the KKT solution mapping associated with the GGL model, by taking advantage of the strict convexity of the function in its effective domain, the nonsingularity of its Jacobian, and Clarke’s implicit function theorem [5, 6]. Consequently, the linear convergence of the iterative sequence generated by the PPDNA can be established based on the classical results in [24]. Moreover, by choosing the penalty parameter to be sufficiently large, the PPDNA can be made to attain any desired linear convergence rate. More generally, the Lipschitz continuity of the KKT solution mapping of the model still holds even if the GGL regularizer is replaced by any other convex positively homogeneous function.
- 2.
We derive a surrogate generalized Jacobian of the proximal mapping of the GGL regularizer. The second order sparsity in the surrogate generalized Jacobian is analyzed in depth and fully exploited in the PPDNA. Therefore, the superlinearly (or even quadratically) convergent semismooth Newton method can solve the PPA subproblems very efficiently since the semismooth Newton directions can be computed cheaply.
- 3.
We introduce fairly easy-to-check stopping criteria (via the duality theory) for computing inexact solutions of the PPA subproblems without sacrificing the global or linear convergence of the PPDNA. In fact, the standard stopping criteria adopted by Rockafellar [24] would involve the unknown optimal values of the subproblems, which are not easy to check unless the objective function is strongly convex with an explicitly given strong convexity parameter.
The remaining parts of the paper are organized as follows. Section 2 presents some definitions and preliminary results, which include the proximal mapping of the GGL regularizer, its generalized Jacobian, the proximal mapping of the log-determinant function and its derivative. We analyze in section 3 the Lipschitz continuity of the KKT solution mapping associated with the GGL model, which is the key property for deriving the linear convergence rate of our proposed algorithm. In section 4, we propose the PPDNA for solving the GGL model and investigate its convergence properties. We report the numerical performance of the PPDNA on categorical text data and time-varying stock prices data in section 5 and conclude the paper in section 6.
Notation. The following notation will be used in the rest of the paper.
- •
() denotes the cone of positive semidefinite (definite) matrices in the space of real symmetric matrices . For any , we write if and if . In particular, () indicates that ().
- •
We let be the Cartesian product of copies of .
- •
For any matrix , denotes the -th element of .
- •
For any , denotes the column vector obtained by taking out the -th elements across all matrices .
- •
denotes the identity matrix, and denotes an identity matrix or map when the dimension is clear from the context.
- •
We use to denote the largest eigenvalue of a self-adjoint linear operator .
- •
For a given closed convex set and a vector , we denote the Euclidean projection of onto by .
- •
We denote ceil as the smallest integer greater than or equal to .
2 Preliminaries
In this section, we first recall the definition and some relevant properties of the Moreau-Yosida regularization of a proper and closed convex function, which will play an important role in the subsequent theoretical analysis and algorithmic design. Let be a finite dimensional real Hilbert space and be a proper and closed convex function. The Moreau-Yosida regularization [21, 35] of is defined by
[TABLE]
and the proximal mapping of , the unique minimizer of (2.1), is given by
[TABLE]
One critical property of the Moreau-Yosida regularization is that is a continuously differentiable convex function with the following gradient:
[TABLE]
In addition, the proximal mapping satisfies the following Moreau identity [25, Theorem 31.5]:
[TABLE]
where is the conjugate function of (see e.g., [25] for its definition).
2.1 Proximal mapping of the GGL regularizer and its generalized Jacobian
We investigate in this section the proximal mapping of the GGL regularizer in (1.2) and its generalized Jacobian. Recall the function in (1.3):
[TABLE]
By definition, the proximal mapping of is given as follows: for any ,
[TABLE]
It is obvious that problem (2.3) is separable for each vector . Therefore, for any , the vector , consisting of all entries of in the -th position, is given explicitly by
[TABLE]
By this equation, one can compute via performing computations of , and this task can be done in parallel. Parts of the second order information of the underlying problem are contained in the generalized Jacobian of , which can be characterized by the generalized Jacobian of through using the relationship (2.4) between and . Fortunately, the generalized Jacobian of has been carefully investigated in [39] and has an explicit expression.
Let the multifunction be the generalized Jacobian of . Directly from the formula (10) in [39], the multifunction can be described as follows: for any ,
[TABLE]
where , and are the Clarke generalized Jacobians (see [5, Definition 2.6.1] for the definition) of and , respectively. Therefore, the surrogate generalized Jacobian of at any given can be described as follows:
[TABLE]
The next proposition will explain why in (2.11) can be treated as the surrogate generalized Jacobian of at . Based on [39, Theorem 3.1], one can easily prove the proposition. We omit the details here.
Proposition 2.1**.**
Let be the GGL regularizer defined by (1.2) and be any given element. The surrogate generalized Jacobian defined in (2.11) is nonempty compact valued and upper semicontinuous. Any element in the set is a self-adjoint and positive semidefinite operator. Moreover, we have that, for any ,
[TABLE]
2.2 Properties of the log-determinant function
In this subsection, we present some properties on the proximal mapping of the following log-determinant function and its derivative that are mainly adopted from the papers [31, 33]:
[TABLE]
Let be given. Define the following scalar functions:
[TABLE]
In addition, for any with eigenvalue decomposition , we define
[TABLE]
One can observe that and are positive definite for any . Using the functions defined above, the following two propositions give the proximal mapping of the log-determinant function and its derivative.
Proposition 2.2**.**
[33, Proposition 2.3]** Let be the log-determinant function defined by (2.12) and be a positive scalar. Then, for any , it holds that
[TABLE]
Proposition 2.3**.**
[31, Lemma 2.1 (b)]** Let be a given positive scalar. The function is continuously differentiable. For any with eigenvalue decomposition , the derivative at any is given by
[TABLE]
where is defined by
[TABLE]
3 Lipschitz continuity of the KKT solution mapping
In this section, we will prove that the KKT solution mapping associated with the GGL problem is Lipschitz continuous. More generally, we emphasize that the Lipschitz continuity of the KKT solution mapping still holds even if the GGL regularizer is replaced by any other convex positively homogeneous function, since the key properties we need from the regularizer are convexity and positive homogeneity.
The analysis in this section is based on Clarke’s implicit function theorem. For notational convenience, we denote
[TABLE]
Then the GGL problem (1.1) can be equivalently reformulated as follows:
[TABLE]
The Lagrangian function associated with problem (3.2) is given by
[TABLE]
and the dual problem of (3.2) is easily shown to be
[TABLE]
In addition, the KKT system associated with (3.2) and (3.3) is given by
[TABLE]
Since the log-determinant function is strictly convex and the solution set to problem (1.1) is assumed to be nonempty, problem (3.2) has a unique solution. Furthermore, by using [3, Proposition 4.75] one can easily show that the KKT system (3.4) also has a unique solution, denoted by .
For any given , we consider the following linearly perturbed form of problem (3.2)
[TABLE]
As in Rockafellar [23], we define the following maximal monotone operator:
[TABLE]
We also define the KKT solution mapping as
[TABLE]
Define a mapping as follows: for any and ,
[TABLE]
Then it is easy to see that if is nonempty, then it must be a singleton and satisfies
Lemma 3.1**.**
Let and be defined by (3.1). Then all and are self-adjoint and positive definite with and .
Proof.
The proof can be derived from [31, Lemma 2.1]. ∎
Since the GGL regularizer defined by (1.2) is positively homogeneous, its conjugate function is an indicator function of a closed convex set [26, Example 11.4(a)]. Therefore, is the projection onto a closed convex set. We know further from [30, Theorem 2.3] that for any , any element in is a self-adjoint operator whose eigenvalues are in the interval . Thus, by the proof of [30, Theorem 2.5], we can obtain the following lemma, which will be used in Theorem 3.1 to analyze the Lipschitz continuity of the KKT solution mapping defined by (3.6).
Lemma 3.2**.**
Let and be any self-adjoint positive definite operator. Then, for any chosen , the linear operator is nonsingular.
The next theorem will play an essential role in establishing the linear rate of convergence of our proposed proximal point dual Newton algorithm (PPDNA) for solving the GGL problems in section 4.3.
Theorem 3.1**.**
Let be the KKT solution mapping defined by (3.6). Then the following hold:
- (a)
Any element in is nonsingular. Here, we say
* if for some linear operator , it holds that .*
- (b)
The mapping is Lipschitz continuous near the origin; i.e., there exist a neighborhood of the origin and a positive scalar such that for any and
[TABLE]
Proof.
Since is directionally differentiable, we know from the Moreau identity (2.2) that is also directionally differentiable. Therefore, it follows from the chain rule presented in [29, Lemma 2.1] that for any , there exist and such that
[TABLE]
Suppose that there exists such that , i.e.,
[TABLE]
It follows from Lemma 3.1 that both and are self-adjoint and positive definite. This, together with (3.9), implies that
[TABLE]
where We know from Lemma 3.2 that is nonsingular. This, together with (3.9) and (3.10), implies that
[TABLE]
Therefore, is nonsingular, and consequently the statement holds.
The global Lipschitz continuities of the proximal mappings and imply that the mapping defined by (3.7) is Lipschitz continuous. Therefore, the proof of can be obtained by , the fact that for any , the set must be a singleton if it is nonempty, and Clarke’s implicit function theorem [5, Page 256] or [6, Theorem 3.6]. The proof is completed. ∎
4 Proximal point dual Newton algorithm
We aim to develop an implementable proximal point dual Newton algorithm (PPDNA) for solving the GGL problem (3.2). The PPDNA is essentially a proximal point algorithm (PPA) for solving the primal form of the GGL model, and the PPA subproblems are solved via their corresponding dual problems. The dual of each subproblem is to maximize a concave function whose gradient is a semismooth function and thus can be solved by the semismooth Newton method. We begin this section by introducing the PPA [24], i.e., given and , the updating scheme is given by
[TABLE]
where
[TABLE]
with being the indicator function of the set , i.e., if , and if
We allow for inexactness in the updating scheme (4.1) and apply the standard criteria proposed by Rockafellar [24] for controlling the inexactness: given nonnegative summable sequences and such that for all ,
[TABLE]
Since the exact minimizer is typically unknown in each iteration, we should introduce practically implementable stopping criteria in place of (A) and (B) in the subsequent analysis.
4.1 Dual based semismooth Newton method for solving PPA subproblems
In this section, we aim to design the aforementioned dual based semismooth Newton method for solving the subproblems of the PPA framework (4.1):
[TABLE]
The Lagrangian function associated with problem (4.3) is given by
[TABLE]
and the Lagrangian dual problem of (4.3) is
[TABLE]
Since we aim to solve problem (4.3) via its dual problem (4.4), a natural idea is to find the explicit expression for the dual objective function first. The explicit expression for can be obtained from the Moreau–Yosida regularization as follows:
[TABLE]
The last equality is achieved when \Omega^{(k)}=\phi_{\sigma_{t}}^{+}\big{(}\Omega_{t}^{(k)}-\sigma_{t}(S^{(k)}+X^{(k)})\big{)}, for , and . One can find that is continuously differentiable and strongly concave, and the unique solution to problem (4.4) can be obtained by solving the following nonsmooth system
[TABLE]
where
[TABLE]
with
[TABLE]
We can see that if , then one has that with \Omega^{(k)}=\phi_{\sigma_{t}}^{+}\big{(}W^{(k)}_{t}(X)\big{)}, for , . Recall that is differentiable and its derivative is given by Proposition phiprime. Thus, the surrogate generalized Jacobian of at any can be defined as follows:
[TABLE]
Based on the surrogate generalized Hessian of , one can apply the following dual based semismooth Newton method (Algorithm 1) for solving problem (4.4) via solving the nonsmooth equation (4.5). To know more about the local and global methods for nonsmooth equations, we refer the readers to [9, Sections 7 & 8] and references therein. The main computational cost of Algorithm 1 lies in Step 1 for finding the Newton direction. Therefore, we carefully analyze the second order sparsity structure in the surrogate generalized Jacobian and fully exploit the structure to reduce the cost. Due to the computation of in and , the -th iteration of Algorithm 1 requires computations of eigenvalue decompositions.
The following proposition states that Algorithm 1 for solving the dual of the PPA subproblem (4.4) is globally convergent and locally superlinearly or even quadratically convergent if the parameter is chosen to be .
Proposition 4.1**.**
Let be the infinite sequence generated by Algorithm 1. Then converges to the unique optimal solution of (4.4), and the convergence rate is at least superlinear:
[TABLE]
Proof.
Since is directionally differentiable, it follows from Proposition 2.1 that is strongly semismooth with respect to the multifunction in (2.11) (for its definition, see e.g., [17, Definition 1]). Therefore, the conclusion follows from the strong concavity of , Proposition 2.3, and [17, Theorem 3]. ∎
4.2 Implementable stopping criteria for PPA subproblems
Due to the lack of explicit forms of the exact solution , the stopping conditions (A) and (B) need to be replaced by some implementable conditions. Since defined by (4.2) is strongly convex with modulus , one has the estimate
[TABLE]
which implies that
[TABLE]
The unknown value can be replaced by any of its lower bounds converging to it. One choice is to consider the objective value of the dual problem (4.4). In particular, one has that
[TABLE]
and hence
[TABLE]
Therefore, we can terminate Algorithm 1 if satisfies the following conditions: given nonnegative summable sequences and such that for all ,
[TABLE]
4.3 Linear rate convergence of PPDNA
Now, we are ready to formally present the promised PPDNA for solving problem (3.2).
Along the line of Rockafellar’s works [23, 24], the local linear convergence rate of the primal and dual iterative sequences generated by the PPA can be guaranteed by the Lipschitz continuity of the KKT solution mapping near the origin under proper stopping criteria of the PPA subproblems. However, the Lipschitz property of the KKT solution mapping requires the uniqueness of the KKT point, and this property is not straightforward to establish when the regularizer is not a piecewise linear-quadratic function. As the property to ensure the linear convergence rate, especially the uniqueness assumption, is too restrictive to hold, Luque [20] extended the results and proved the local linear convergence of the PPA under the local upper Lipschitz continuity (see e.g., [22, p. 208] for the definition) of the KKT solution mapping at the origin [7, p. 387]. The local upper Lipschitz continuity condition does not make the assumption on the uniqueness of the solution. However, the local upper Lipschitz continuity property may not hold when the KKT solution mapping is not piecewise polyhedral. Fortunately, for our GGL model, the strict convexity of the log-determinant function guarantees the uniqueness of the solution, and we prove in Theorem 3.1 that the KKT solution mapping of the GGL model (defined by (3.6)) is Lipschitz continuous near the origin by taking advantage of the nice properties of the log-determinant function and Clarke’s implicit function theorem. Therefore, the local linear convergence rate of the PPDNA can be obtained via the classical results by Rockafellar. The convergence results of Algorithm 2 for solving problem (3.2) are presented below.
Theorem 4.1**.**
Let be an infinite sequence generated by Algorithm 2 under stopping criterion . Then the sequence converges to the unique solution of (3.2), and the sequence converges to the unique solution of (3.3). Furthermore, if both criteria and are executed in Algorithm 2, then there exists such that for all , one has that
[TABLE]
where the convergence rate is given by
[TABLE]
and the parameter is from (3.8).
Proof.
The global convergence of Algorithm 2 can be obtained from (4.6), [24, Theorem 1], and the uniqueness of the KKT point. The linear rate of convergence can be derived from (4.6), Theorem 3.1, and [24, Theorem 2]. The proof is completed. ∎
4.4 Extensions of PPDNA
Although the theoretical analysis and the algorithmic design presented in section 3 and section 4 focus on the GGL regularizer, these results can also be applied to the joint graphical model (1.1) with a different regularizer satisfying the following conditions:
- (a)
the regularizer is convex and positively homogenous, (e.g., a norm function);
- (b)
the proximal mapping associated with the regularizer can be efficiently computed and its surrogate generalized Jacobian can be explicitly characterized.
For example, we can show that both the pairwise fused graphical Lasso regularizer [8, Equation (5)] and the sequential fused graphical Lasso regularizer [34, Formula (2.2)] satisfy the conditions (a) and (b). More specifically, let and be positive parameters. The pairwise fused graphical Lasso regularizer and sequential fused graphical Lasso regularizer are given as follows:
[TABLE]
where , and
[TABLE]
where
By applying the same procedure as in section 2.1 for the GGL regularizer, we can obtain the proximal mappings associated with and their surrogate generalized Jacobians from that of the clustered Lasso regularizer [18] and the fused lasso regularizer [17], respectively. Therefore, we can apply our PPDNA framework for solving the joint graphical model with a different regularizer given by either (4.7) or (4.8).
In addition to the direct extensions to the two regularizers above, the PPDNA framework is also applicable to joint graphical models with other regularizers discussed in [13]. More specifically, the regularizers in [13] have the following form:
[TABLE]
where All the choices of the penalty function in [13, Section 2.1] can ensure condition (a) except for the Laplcacian penalty . Therefore, except for the case of the Laplacian penalty, the PPDNA framework can be directly applied once condition (b) holds. For the exceptional case, we may slightly modify our framework. Specifically, each iteration should be modified as follows: given and , the updating scheme is given by
[TABLE]
where
[TABLE]
with being the indicator function of the set
[TABLE]
Then the resulting modified PPDNA can be obtained by using arguments similar to those in section 4. But we should mention that further investigation will be necessary to overcome the underlying difficulty that the dual of the subproblems of (4.9) may not necessarily be strongly convex. We leave this part as our future research topic.
5 Numerical results
In this section, we evaluate the performance of the PPDNA in comparison with the ADMM and the proximal Newton-type method implemented in the work [34], which is referred to as MGL111The solver is available at http://senyang.info/.. All the experiments are performed in Matlab (version 9.7) on a Windows workstation (24-core, Intel Xeon E5-2680 @ 2.50GHz, 128 GB of RAM).
5.1 Implementation of ADMM
In this section, we briefly describe the ADMM for solving the dual problem (3.3), which can be equivalently written as follows:
[TABLE]
Given a parameter , the augmented Lagrangian function associated with (5.1) is defined by
[TABLE]
and the KKT optimality conditions are
[TABLE]
The iterative scheme of the ADMM for problem (5.1) can be described as follows: given and an initial point , the -th iteration is given by
[TABLE]
where can be updated by , and can be updated by Z^{(k)}_{t+1}=\phi^{+}_{{\sigma}^{-1}}\big{(}X^{(k)}_{t}-\frac{1}{\sigma}\Theta^{(k)}_{t}+S^{(k)}\big{)}, .
In the practical implementation, we tuned the parameter according to the progress of primal and dual feasibilities (see e.g., [15, Section 4.4]) and used a larger step-length of . These two techniques can empirically accelerate the convergence speed. It is worth noting that the ADMM implemented by Yang et al. [34] used a fixed penalty parameter and the step-length .
5.2 Settings of experiments
The experimental settings are the same as those in [38, Section 4]. We adopt the stopping criteria of PPDNA, ADMM and MGL as below. Let be a given tolerance. It is set as in the following experiments.
- •
The PPDNA is terminated if , where
[TABLE]
with
[TABLE]
- •
The ADMM is terminated when or iterations are taken, where
[TABLE]
- •
The MGL is terminated when the relative difference of its objective value with respect to the primal objective value obtained by the PPDNA is smaller than the given tolerance or the relative duality gap achieved by the PPDNA, i.e.,
[TABLE]
where , , and are the objective values obtained by the PPDNA, the MGL, and the relative duality gap attained by the PPDNA.
It is worth mentioning that we adopt a warm-starting technique in the initial stage of the PPDNA, instead of starting it from scratch. The warm-starting procedure consists of first running the ADMM (with identity matrices as the starting point) for a fixed number of iterations (3000 steps in our experiments) or up to a given tolerance ( in our experiments), and then using the resulting approximate solution as an initial point to warm-start the PPDNA. This idea is greatly motivated by two facts: 1) the ADMM can generate a solution of low to medium accuracy efficiently and might become slow when higher accuracy is required; 2) our algorithm PPDNA has been proven to be locally linearly convergent. Therefore, the warm-starting technique can integrate the advantages of both ADMM and PPDNA.
We set the initial parameter in the stopping criterion to be , and decrease it by a ratio , i.e., . Likewise, the parameter in the stopping criterion is updated in the same fashion as that for . For the parameters in Algorithm 1, we simply set and according to [40]. For the step on line search, we set and .
5.3 Descriptions of datasets
In this part, we describe the datasets which will be used later. Since these datasets have been discussed in [38], we briefly review them for the ease of reading:
- •
University webpages data set222http://ana.cachopo.org/datasets-for-single-label-text-categorization: The original data was collected from computer science departments of various universities in 1997, manually classified into seven different classes: Student, Faculty, Course, Project, Staff, Department, and Other. The data we use, consisting of the first four classes, is preprocessed by stemming techniques [4]. Two thirds of the pages were randomly chosen for training (Webtrain) and the remaining third for testing (Webtest).
- •
20 newsgroups data set333http://qwone.com/$\sim$jason/20Newsgroups/: This data set has 20 topics of newsgroup documents, and some of the topics are closely related to each other, while others are highly unrelated. Four subgroups are named as NGcomp, NGrec, NGsci, and NGtalk accordingly and will be used in our experiments.
- •
SPX500 component stocks444www.yahoo.com: This data set contains the daily returns of Standard & Poor’s 500 (SPX500) constituents from 2004 to 2014. We also test on extracted data from 2004 to 2006.
5.4 Performance of PPDNA
In this part, we first give an elementary report of the effectiveness of the GGL model on synthetic nearest-neighbor networks generated by the mechanism in [16]. Second, we illustrate numerically the local linear convergence of the PPDNA for solving two representative instances, in correspondence with Theorem 4.1 which shows theoretically the local linear convergence of the PPDNA.
5.4.1 Synthetic data: nearest-neighbor networks
In this example, we choose and . The synthetic precision matrices, denoted as , are generated as follows. We first generate points on a unit square randomly, calculate their pairwise distances, and identify nearest neighbors of each point. The nearest-neighbor network is then obtained by linking any two points that are nearest neighbors of each other, and we denote the number of its edges as . Subsequently, we obtain each by adding extra edges to the common nearest-neighbor network. For each , a pair of symmetric zero elements is randomly selected from the nearest-neighbor network and replaced with a value uniformly drawn from the interval . is obtained after this procedure is repeated ceil times. We find in our simulation that the true number of edges in the three networks is . Given the precision matrices, we draw samples from each Gaussian distribution to compute the sample covariance matrices. Next we specify the tuning parameters and . Following [8], we reparameterize and in order to separate the regularization for “sparsity” and for “similarity” since both parameters contribute to sparsity: drives individual network edges to zero whereas drives network edges to zero across all network estimates at the same time. We reparameterize them in terms of which are found in [8] to reflect the levels of sparsity and similarity regularization and are called the sparsity and similarity control parameters, respectively. In order to show the diversity of sparsity in our experiments, we change with fixed. Figure 2 characterizes the relative abilities of the GGL model to recover the network structures and to detect change-points.
Figure 2(a) displays the number of true positive (TP) edges selected against the number of false positive (FP) edges. We say that an edge is selected in the estimate if , and the edge is true if and false if . We can see that the model with can recover almost all of the TP edges without FP edges. This suggests that the GGL model is effective for recovering the edges in the nearest-neighbor networks. Figure 2(b) illustrates the sum of squared errors between estimated edge values and true edge values, i.e., \sum_{k=1}^{K}\sum_{i<j}\big{(}\overline{\Theta}^{(k)}_{ij}-\Sigma^{(k)}_{ij}\big{)}^{2}. When the number of the total edges selected increases (i.e., the sparsity control parameter decreases), the error decreases and finally reaches a fairly low value. Figure 2(c) plots the number of TP differential edges against FP differential. An edge that differs between networks is called a differential edge and thus corresponds to a change-point. Numerically, we say that the edge is estimated to be differential between the -th and the -th networks if , and we say that it is truly differential if . The number of differential edges is computed for all successive pairs of networks. One can observe in Figure 2(c) that the results obtained with have approximately 3000 TP differential edges and almost no false ones. This suggests that the GGL model can be a suitable model to use in change-point detection of nearest-neighbor networks.
5.4.2 Linear rate convergence
The purpose of this section is to demonstrate numerically the local linear convergence of the PPDNA. Specifically, we conduct experiments on two representative instances: (a) categorical data: Webtrain with , e-e-; (b) time-varying data: SPX500 with , e-e-. Due to the lack of exact optimal solutions of these instances, we run the PPDNA until the accuracy of is achieved and regard the resulting approximate solution as the true solution . We denote
[TABLE]
In Figure 3, we plot against the iteration count under two different choices of the penalty parameter : is fixed or increased by a ratio. When is fixed, the solid blue line in the figure indicates that the convergence rate is almost constant. When , i.e., the penalty parameter is gradually increasing, the dash-dotted red line shows that the convergence rate is increasingly fast. The observation is consistent with Theorem 4.1, which demonstrates numerically the local linear convergence rate of the PPDNA. We should emphasize that the impressive linear convergence rate depicted in the solid blue curve in Figure 3(a) is attained with fixed at a large value of , whereas the slower initial convergence shown in the dash-dotted red curve is due to slowly increasing the parameter from a small initial value of . The same remark is also applicable to Figure 3(b).
The dependence of the linear rate of convergence on also sheds light on the choice of in our implementation. Basically we adaptively update to strike a good balance in the trade-off between the convergence rate of the PPDNA and the difficulty in computing the Newton directions (via the CG method) in the semismooth Newton method (Step 1 of Algorithm 1). As the condition number of the Newton linear system in Step 1 of Algorithm 1 is proportional to , the CG method will converge more slowly for a larger . Thus in our experiments, we start from a small , e.g., , and gradually increase by some factor , i.e., .
5.5 Comparison with ADMM and MGL
In this section, we compare our algorithm PPDNA for solving the GGL model with the ADMM described in (5.2) and the MGL implemented in [34]. For the tuning parameters and , we select three pairs for each instance that produce reasonable sparsity. In the following tables, “P” stands for PPDNA; “A” stands for ADMM; “M” stands for MGL. In the column under “Iteration”, we report the number of iterations taken by various algorithms. In particular, for the PPDNA, we report the number of the PPA iterations taken and the number (within the parentheses) of semismooth Newton linear systems solved. Let “nnz” denote the number of nonzero entries in the solution obtained by the PPDNA using the following estimation: where is the vector obtained by sorting all elements of by magnitude in a descending order. In the tables, “density” denotes the quantity nnz. The time is displayed in the format of “hours:minutes:seconds”, and the fastest method is highlighted in red. The error reported for the PPDNA in the tables is the relative KKT residual . That of the ADMM is ; while the error for the MGL is .
Table 1 shows the comparison of three methods PPDNA, ADMM, and MGL on the university webpages data sets. The PPDNA successfully solved all instances in Table 1 within about one minute. For a large majority of tested instances, the PPDNA is faster than the ADMM and the MGL. It suggests that the PPDNA is robust and efficient for solving the GGL model applied to the university webpages data.
Table 2 presents the comparison of PPDNA, ADMM, and MGL on the 20 newsgroups data sets. One can see clearly that the PPDNA outperforms the ADMM and the MGL for most instances in Table 2. It demonstrates that the PPDNA can be efficient for solving the GGL model. For some difficult instances, e.g., NGcomp train e-e-, our PPDNA took less than one minute while the MGL took more than one hour. Again, the results show that our PPDNA is robust for solving the GGL model. The superior performance of our PPDNA can primarily be attributed to our ability to extract and exploit the sparsity structure (in ) within the semismooth Newton method to solve the PPA subproblems very efficiently.
Table 3 gives the results on the Standard & Poor’s 500 component stock price data set SPX500. The table shows that the PPDNA is faster than both the ADMM and the MGL for all instances. In addition, we find that both the PPDNA and the ADMM succeeded in solving all instances, while the MGL failed to solve one of them within three hours. This might imply that the MGL is not robust for solving the GGL model when applied to the stock price data sets. The numerical results show convincingly that our algorithm PPDNA can solve the GGL problem highly efficiently and robustly.
6 Concluding remarks
In this paper, we have taken advantage of the ideas proposed in [17, 39] and implemented a proximal point dual Newton algorithm (PPDNA) to the primal formulation of the group graphical Lasso problems. From a theoretical standpoint, we have shown that the PPDNA is globally convergent and the sequence of primal and dual iterates is Q-linearly convergent, although the group graphical Lasso regularizer is non-polyhedral. The robustness and superior numerical efficiency of the PPDNA are convincingly demonstrated in various numerical experiments. Therefore, we can firmly conclude that the PPDNA is not only a fast method with nice theoretical guarantees, but also a numerically efficient method for solving the group graphical Lasso problems with multiple precision matrices.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] O. Banerjee, L. E. Ghaoui, and A. d’Aspremont , Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data , Journal of Machine Learning Research, 9 (2008), pp. 485–516.
- 2[2] M. Bollh o ¨ ¨ 𝑜 \ddot{o} fer, A. Eftekhari, S. Scheidegger, and O. Schenk , Large-scale sparse inverse covariance matrix estimation , SIAM Journal on Scientific Computing, 41 (2019), pp. A 380–A 401.
- 3[3] J. F. Bonnans and A. Shapiro , Perturbation Analysis of Optimization Problems , Springer Science & Business Media, 2000.
- 4[4] A. Cardoso-Cachopo , Improving methods for single-label text categorization . Ph D Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.
- 5[5] F. H. Clarke , Optimization and Nonsmooth Analysis , vol. 5, SIAM, 1990.
- 6[6] F. H. Clarke, Y. S. Ledyaev, R. J. Stern, and P. R. Wolenski , Nonsmooth Analysis and Control Theory , vol. 178, Springer, New York, NY, 1998.
- 7[7] Y. Cui, D. F. Sun, and K.-C. Toh , On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming , Mathematical Programming, 178 (2019), pp. 381–415.
- 8[8] P. Danaher, P. Wang, and D. M. Witten , The joint graphical lasso for inverse covariance estimation across multiple classes , Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 (2014), pp. 373–397.
