A New Homotopy Proximal Variable-Metric Framework for Composite Convex Minimization
Quoc Tran-Dinh, Liang Ling, and Kim-Chuan Toh

TL;DR
This paper introduces a novel homotopy proximal variable-metric framework for composite convex minimization, achieving finite iteration bounds and computational efficiency, especially in covariance estimation problems without matrix inversion.
Contribution
It develops a new parameterization for optimality conditions enabling homotopy proximal variable-metric methods and introduces a primal-dual-primal framework for proximal-Newton methods.
Findings
Achieves finite global iteration-complexity bounds under certain conditions.
Develops a primal-dual-primal framework that improves computational features.
Applies the method to covariance estimation without matrix inversion.
Abstract
This paper suggests two novel ideas to develop new proximal variable-metric methods for solving a class of composite convex optimization problems. The first idea is a new parameterization of the optimality condition which allows us to develop a class of homotopy proximal variable-metric methods. We show that under appropriate assumptions such as strong convexity-type and smoothness, or self-concordance, our new schemes can achieve finite global iteration-complexity bounds. Our second idea is a primal-dual-primal framework for proximal-Newton methods which can lead to some useful computational features for a subclass of nonsmooth composite convex optimization problems. Starting from the primal problem, we formulate its dual problem, and use our homotopy proximal Newton method to solve this dual problem. Instead of solving the subproblem directly in the dual space, we suggest to dualize…
| Dataset | a1a | a9a | covtype | news20 | mnist17 | mnist38 | rcv1 | real-sim | w1a | w8a |
|---|---|---|---|---|---|---|---|---|---|---|
| #samples[] | 30956 | 16281 | 581012 | 19996 | 317402 | 292363 | 677399 | 72309 | 47272 | 14951 |
| #features[] | 123 | 122 | 54 | 1355191 | 784 | 784 | 47236 | 20958 | 300 | 300 |
| Datasets | HomoPN | HomoQuasiPN | PG | Ls-Rs-APG | Sparsity | | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| iter | time[s] | rgap | iter | time[s] | rgap | iter | time[s] | rgap | iter | time[s] | rgap | |||
| a1a | 7 | 0.247 | 0.3 | 13 | 0.190 | 0.3 | 1049 | 12.264 | 10.0 | 1088 | 16.272 | 7.6 | 11.38% | 1e-2 |
| a9a | 6 | 0.153 | 0.4 | 13 | 0.169 | 1.1 | 1101 | 6.849 | 10.0 | 651 | 4.959 | 3.5 | 12.30% | 1e-2 |
| covtype | 12 | 3.621 | 2.1 | 121 | 5.938 | 7.0 | — | — | — | 10508 | 3997.766 | 9.8 | 9.26% | 1e-5 |
| mnist17 | 6 | 101.088 | 0.3 | 24 | 25.030 | 6.2 | — | — | — | 912 | 2030.214 | 8.0 | 10.08% | 2.5e-3 |
| mnist38 | 4 | 115.304 | 6.1 | 14 | 36.109 | 9.7 | — | — | — | 13392 | 4151.542 | 10.0 | 11.86% | 5.5e-3 |
| news20 | 5 | 9.149 | 0.1 | 13 | 104.030 | 1.2 | 1307 | 71.579 | 10.0 | 90 | 57.225 | 3.4 | 11.04% | 2.5e-6 |
| rcv1 | 11 | 244.589 | 7.7 | 41 | 782.247 | 3.8 | 10291 | 3254.890 | 10.0 | 409 | 1222.426 | 5.0 | 10.14% | 2.5e-6 |
| real-sim | 6 | 3.966 | 7.7 | 22 | 24.055 | 6.5 | 730 | 14.182 | 10.0 | 103 | 14.255 | 9.0 | 11.25% | 2e-5 |
| w1a | 7 | 0.620 | 1.6 | 19 | 0.426 | 6.7 | 910 | 31.084 | 10.0 | 666 | 13.710 | 7.6 | 10.00% | 1e-3 |
| w8a | 7 | 0.245 | 1.4 | 19 | 0.380 | 2.1 | 891 | 9.172 | 10.0 | 297 | 1.856 | 6.6 | 10.00% | 1e-3 |
| Problem | (#Iterations) Time[s] | Objective value | |||||||
|---|---|---|---|---|---|---|---|---|---|
| HomoPN | MUL[24] | IP[39] | SDPT3 | HomoPN | MUL[24] | IP[39] | SDPT3 | ||
| 1 | 10000 | (7)0.088 | (2509)0.407 | (127)0.262 | 0.399 | 20.51196 | 20.51254 | 20.51195 | 20.51241 |
| 1 | 50000 | (7)0.368 | (2510)1.050 | (122)0.998 | 1.127 | 20.50907 | 20.50981 | 20.50907 | 20.50908 |
| 1 | 100000 | (7)0.956 | (2855)3.073 | (120)1.962 | 2.008 | 20.50871 | 20.50943 | 20.50872 | 20.50883 |
| 2 | 10000 | (7)0.057 | (2493)0.571 | (102)0.211 | 0.368 | 0.410236 | 0.410745 | 0.410221 | 0.410220 |
| 2 | 50000 | (6)0.266 | (3278)3.122 | (101)0.867 | 1.280 | 0.409263 | 0.409964 | 0.409267 | 0.409260 |
| 2 | 100000 | (5)0.525 | (3910)9.361 | (100)1.935 | 2.701 | 0.409143 | 0.409795 | 0.409154 | 0.409145 |
| 3 | 10000 | (5)0.048 | (1619)0.428 | (97)0.236 | 0.415 | 5.142670 | 5.142919 | 5.142671 | 5.142670 |
| 3 | 40000 | (5)0.214 | (2208)2.378 | (97)0.784 | 1.578 | 5.082114 | 5.082363 | 5.082119 | 5.082113 |
| 3 | 90000 | (5)0.535 | (2407)7.178 | (95)1.774 | 3.827 | 5.062011 | 5.062261 | 5.062024 | 5.062011 |
| 4 | 10000 | (6)0.069 | (2512)0.488 | (135)0.271 | 0.325 | 7.251897 | 7.252565 | 7.251890 | 7.251888 |
| 4 | 50000 | (6)0.402 | (3413)2.870 | (130)1.045 | 1.064 | 7.251892 | 7.252527 | 7.251895 | 7.251889 |
| 4 | 100000 | (6)1.032 | (4013)8.480 | (128)2.554 | 2.294 | 7.251891 | 7.252461 | 7.251902 | 7.251888 |
| Problem | Time [s] | Objective value | |||||
|---|---|---|---|---|---|---|---|
| HomoPN | MUL[24] | IP[39] | HomoPN | MUL[24] | IP[39] | ||
| 1 | 10000 | 0.203 | 9.172 | 57.830 | 92.552013 | 92.552097 | 96.377758 |
| 1 | 50000 | 1.134 | 48.288 | 264.658 | 92.541322 | 92.541593 | 96.373413 |
| 1 | 100000 | 2.589 | 97.433 | 468.258 | 92.539162 | 92.540231 | 96.372813 |
| 2 | 10000 | 0.303 | 6.950 | 71.1327 | 18.424093 | 18.424563 | 22.387641 |
| 2 | 50000 | 1.555 | 43.340 | 229.223 | 18.416971 | 18.417293 | 22.388450 |
| 2 | 100000 | 2.862 | 84.389 | 543.588 | 18.416223 | 18.416356 | 22.389739 |
| 3 | 10000 | 0.188 | 17.657 | 2.140 | 30.158301 | 30.158329 | 30.158325 |
| 3 | 50000 | 0.822 | 72.969 | 240.569 | 29.956255 | 29.956513 | 37.689778 |
| 3 | 100000 | 2.217 | 165.105 | 556.789 | 29.889216 | 29.889565 | 37.692640 |
| Dataset | vegas | games | news | 911 | facbook | wine |
|---|---|---|---|---|---|---|
| # samples | 504 | 4940 | 39644 | 33059 | 99003 | 150930 |
| # features | 90 | 156 | 59 | 21 | 254 | 1154 |
| Datasets | Time [s] | Number of iterations | |||||
| HomoPN | BFGS | L-BFGS-LS-R | HomoPN | BFGS | L-BFGS-LS-R | ||
| vegas | 0.04 | 0.23 | 0.21 | 5 | 11 | 11 | 4.2695e+00 |
| games | 0.12 | 1.34 | 0.99 | 6 | 22 | 22 | 4.5610e+00 |
| news | 1.82 | 52.94 | 48.57 | 9 | 220 | 215 | 1.1447e+02 |
| 911 | 0.10 | 0.87 | 0.55 | 6 | 15 | 15 | 8.2290e+00 |
| 1.54 | 62.36 | 64.68 | 8 | 39 | 48 | 4.6651e+01 | |
| wine | 2.89 | 455.28 | 459.81 | 6 | 40 | 40 | 1.9819e+01 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\headers
A New Homotopy Proximal Variable-Metric FrameworkQ. Tran-Dinh, Liang Ling, and Kim-Chuan Toh
A New Homotopy Proximal Variable-Metric Framework for Composite Convex Minimization
Quoc Tran-Dinh Department of Statistics and Operations Research, University of North Carolina at Chapel Hill (UNC), 333-Hanes Hall, Chapel Hill, NC27599-3260, USA.
Email: [email protected].
Liang Ling*†*
Kim-Chuan Toh Department of Mathematics, and Institute of Operations Research and Analytics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076.
Email:{liang.ling, mattohkc}@nus.edu.sg)
Abstract
This paper suggests two novel ideas to develop new proximal variable-metric methods for solving a class of composite convex optimization problems. The first idea is a new parameterization of the optimality condition which allows us to develop a class of homotopy proximal variable-metric methods. We show that under appropriate assumptions such as strong convexity-type and smoothness, or self-concordance, our new schemes can achieve finite global iteration-complexity bounds. Our second idea is a primal-dual-primal framework for proximal-Newton methods which can lead to some useful computational features for a subclass of nonsmooth composite convex optimization problems. Starting from the primal problem, we formulate its dual problem, and use our homotopy proximal Newton method to solve this dual problem. Instead of solving the subproblem directly in the dual space, we suggest to dualize this subproblem to go back to the primal space. The resulting subproblem shares some similarity promoted by the regularizer of the original problem and leads to some computational advantages. As a byproduct, we specialize the proposed algorithm to solve covariance estimation problems. Surprisingly, our new algorithm does not require any matrix inversion or Cholesky factorization, and function evaluation, while it works in the primal space with sparsity structures that are promoted by the regularizer. Numerical examples on several applications are given to illustrate our theoretical development and to compare with state-of-the-arts.
Keywords: Homotopy method; proximal variable-metric algorithm; global convergence rate; finite iteration-complexity; primal-dual-primal framework; composite convex minimization.
{AMS}
90C25, 90C06, 90-08
1 Introduction
Problem statement
We are interested in the following composite convex minimization template that covers various of applications in different fields including statistics, machine learning, image and signal processing, and engineering [3, 4, 8, 9, 15, 49]:
[TABLE]
where and are proper, closed, and convex functions. Here, often represents a loss function or a data fidelity term, while is considered as a regularizer or a penalty to promote some desired structures of the final solutions.
Motivation
This paper aims at addressing two questions arisen from numerical methods for solving (1). The first question concerns the global iteration-complexity of second-order-type methods. It is well-known that second-order methods such as Newton-type algorithms have fast local convergence rates under certain assumptions. In particular, the classical Newton method can achieve a local quadratic convergence rate under the local Lipschitz continuity of the Hessian around an optimal solution and the regularity of such an optimal solution [14]. However, global convergence behaviors as well as global convergence rates and iteration-complexity estimates of second-order-type methods have not yet been well understood. Recent attempts to address the aforementioned issues have been made for Newton-type methods [42, 44, 45], but they are still limited to some subclasses of problems such as self-concordant and global Lipschitz Hessian functions. In the first part of this paper, we address the following question.
- •
When can we design second-order-type methods that achieve global iteration-complexity?
Unfortunately, we do not have a complete answer for this question. However, we identify three different subclasses of (1) where we can develop new proximal variable-metric methods to achieve a global iteration-complexity. Our algorithms can solve nonsmooth instances of (1), but require to be smooth and satisfy other additional mild conditions that are different from existing methods.
In the second part of this paper, we address another situation of (1). We observe that existing methods for solving nonsmooth instances of (1) can be classified into the following categories:
- (a)
If is smooth with the Lipschitz gradient and is nonsmooth but proximally tractable as defined in Subsection 2.2, then accelerated proximal-gradient methods achieve optimal convergence rate of , where is the iteration counter. If is twice differentiable, and its Hessian is Lipschitz continuous [34] or self-concordant [41], then we can apply proximal-Newton methods [34, 59, 61] to efficiently solve (1). 2. (b)
If both and are proximally tractable, then operator splitting schemes such as Douglas-Rachford’s methods can be used to efficiently solve (1) but with a sublinear rate [3, 12]. 3. (c)
If for a given linear operator , and both and are proximally tractable, then primal-dual methods such as Chambolle-Pock’s and primal-dual hybrid gradient methods, and alternating direction methods of multipliers (ADMM) can be applied to (1). These methods also achieve a sublinear rate in general.
We instead consider the following subclass of (1), where
- (d)
is self-concordant as defined in Definition A.1; and is given by , where is a linear operator, and is nonsmooth and convex, but proximally tractable.
Under this setting, existing methods such as proximal-gradient-type schemes are often not efficient for solving (1) due to the expensive evaluation of the proximal operator of . We address the following research question:
- •
What is an appropriate solution method to solve (1) under the conditions stated in the subclass ?
This question may have multiple answers. One can apply some primal-dual methods to solve it. However, these methods only have a sublinear convergence rate. We instead propose a primal-dual-primal approach to solve (1) which consists of the following steps:
Construct the Fenchel dual problem of (1) when . 2. 2.
Apply our homotopy proximal-Newton method in the first part to solve the dual problem. 3. 3.
Instead of solving the dual subproblem, we dualize it to go back to the primal space. 4. 4.
Construct an approximate primal solution of (1) from its dual approximate solution.
The idea of using primal-dual approach is classical, but our primal-dual-primal method has various computational advantages as well as a linear convergence rate when it is applied to the subclass (d) of (1). As a motivating example, we will show in Section 6.6 that this approach is very suitable for covariance estimation problem (2) below.
Examples
Apart from two research questions above, our paper is also motivated by several prominent applications. Let us recall a few concrete examples of (1):
*Covariance estimation models: * If in (1), where is a given symmetric matrix, then (1) covers both covariance and inverse covariance estimation problems in the literature depending on the choice of [2, 18, 32]:
[TABLE] 2. 2.
Poisson log-likelihood models: If we choose , where is a given dataset, then (1) covers Poisson log-likelihood models used in medical imaging, see, e.g., [35]. 3. 3.
Regularized logistic regression: If we choose , where is a given dataset, and is a regularization parameter, then (1) covers the well-known logistic models including both sparse and group sparse settings under an appropriate choice of . 4. 4.
Poisson regression: If , where is given, then we obtain a Poisson regression problem as studied in [27, 29]. 5. 5.
Distance-weighted discrimination (DWD): If , for some fixed order , then this model can be considered as a slight modification of the distance-weighted discrimination (DWD) for binary classification studied in [33, 40].
Many other applications of (1) that fit our assumptions can be found, for example in [11, 48, 61].
Literature review
Problem (1) is well studied in the literature under different assumptions on and . Hitherto, several methods have been proposed for solving (1). Such methods include disciplined convex programming [22, 64], proximal gradient and accelerated proximal gradient [4, 41, 43], proximal Newton-type [6, 34, 61], splitting and alternating optimization [8, 15, 20, 68], primal-dual [9, 58], coordinate descent [16, 50, 46, 65], conditional gradient [23, 28], stochastic gradient-type methods [1, 13, 47, 53, 66], and incremental proximal gradient schemes [7].
Existing first-order methods for solving (1) heavily rely on the assumption that has Lipschitz gradient [41] and the proximal tractability of [3, 49] as defined in Subsection 2.2. Another common subclass of (1) is that for a given linear operator , and both and are proximally tractable. Under this setting, operator splitting and primal-dual approaches can be applied to solve (1). Notable works in this direction include primal-dual hybrid gradient schemes, Chambolle-Pock’s methods, Douglas-Rachford and Vu-Condat splitting algorithms, and alternating direction methods of multipliers [3, 8, 9, 15, 21]. While first-order methods offer a low per-iteration computational complexity, they often require a large number of iterations and have a sublinear convergence rate. In addition, their efficiency also depends sensitively on the scaling and conditioning of the problem [19].
Proximal second-order methods such as proximal quasi-Newton [6, 30, 54] and proximal-Newton methods [34, 61] often achieve a high accuracy solution and have good local convergence rate but they usually have high per-iteration computational complexity. In proximal second-order-type methods, the trade-off between iteration-complexity and per-iteration computational complexity is crucial to obtain a good performance. Some existing works such as [6, 25, 26, 30, 59, 61] have provided evidence showing that second-order methods outperform first-order methods for some important subclasses of (1). The recent work [31] also studied the global linear convergence of Newton methods, but using a different concept called “-Hessian stable”. Nevertheless, it is completely different from our approach.
Our approach
Our approach here relies on a combination of different ideas. The first idea is the homotopy method, which has been used in interior-point methods [41] and recently in path-following proximal Newton algorithms [60], where the main iterations rely on a scaled proximal Newton scheme [60]. The second idea is a new parameterization of the optimality condition of (1) as presented in Subsection 3.1. Our third idea is inspired by the generalized self-concordance concept introduced in [56]. The last one is a primal-dual-primal framework that we have mentioned above.
Our contribution
Our contribution can be summarized as follows.
We suggest a new parameterization for the optimality condition of (1) as a framework to study homotopy proximal variable-metric methods for solving different subclasses of (1). This framework covers homotopy proximal-gradient, proximal quasi-Newton, and proximal-Newton methods, and their inexact variants as special cases.
We propose a homotopy proximal variable-metric scheme, Algorithm 1, to solve (1) based on our new parameterization strategy. We show that this scheme achieves a global linear convergence rate under the strong convexity and Lipschitz gradient assumptions w.r.t. a local norm, and the Lipschitz continuity of . We also propose an inexact homotopy proximal-Newton method to solve (1). Under the self-concordant property of , and either the Lipschitz continuity of or the barrier property of , our algorithm can also achieve a finite global iteration-complexity estimate. With an appropriate choice of initial points or suitable assumptions on and/or , our method can achieve a linear convergence rate.
We propose a primal-dual-primal approach for a subclass of (1) where is self-concordant. This approach produces a new homotopy primal-dual proximal-Newton algorithm which can also achieve a linear convergence rate under given assumptions.
We specialize our algorithm to solve a special case (2) of (1) known as a regularized covariance estimation problems studied in [2, 18, 25, 26, 32, 59, 61]. This algorithmic variant possesses the following new features compared to existing works [25, 26, 59, 61]. First, it is applicable to any regularizer instead of just the -norm as in [25, 26]. Second, it deals with the dual form of (2), while allowing one to reconstruct an approximate primal solution for (2). Third, it does not require any Cholesky factorization or matrix inversion as in [59] by working on the dual form. Fourth, the subproblem for computing proximal Newton directions is in the primal space of which has some special structures as promoted by the regularizer instead of in the dual space where the dual variable has structures that are correspondingly promoted by the conjugate . The last point is important computationally when promotes the sparsity or low-rankness of the solutions.
Let us emphasize the following aspects of our contribution. Firstly, our new parameterization strategy can potentially be used to develop new numerical methods for different subclasses of (1) instead of only the four cases studied in this paper. Secondly, our path-following scheme for finding an appropriate initial point of Algorithm 1 is independent of a starting point as shown in Theorem 5.2. Thirdly, even for a strongly convex and Lipschitz gradient function , our homotopy scheme has advantages in sparse optimization as discussed in Section 4.1. Fourthly, (1) is different from the barrier formulation considered in [60], where we do not use any penalty parameter for in (1) as compared to [60]. In addition, [60] is aimed at solving constrained convex optimization problems where the barrier is induced from the feasible set. Finally, for the covariance estimation problem (2), our method shares some similarity with [25, 26, 59, 60], but it is still fundamentally different. While [25, 26] focused on the sparse instance of (2), we consider a more general form in (1) that covers this example as a special case. Our algorithm and its convergence guarantee are completely different and rely on a different approach compared to [25, 26]. It has an iteration-complexity analysis for a genera , while the analysis in [25, 26] rely critically on the special structure of the -norm for .
Paper organization
In Section 2, we recall some preliminary results used in this paper. Section 3 presents a new parameterization for the optimality condition of (1) and a conceptual three-stage proximal variable-metric framework, Algorithm 1, for solving (1). Section 4 analyzes the convergence of Algorithm 1 under three sets of assumptions. Section 5 proposes some procedures to find an appropriate starting point for Algorithm 1. Section 6 proposes a primal-dual-primal method for solving a nonsmooth subclass of (1), and its application to the covariance estimation problem (2). Section 7 provides several numerical experiments to illustrate our theoretical results. All technical proofs are deferred to the appendices.
2 Preliminaries: Scaled proximal operators and optimality condition
In this section, we recall some basic concepts which will be used in the sequel.
2.1 Basic notation and concepts
We work on the vector space equipped with the standard inner product and the corresponding Euclidean norm . We use to denote the set of all symmetric positive definite matrices in . For a given , we use to denote the weighted norm. The corresponding dual norm is .
For a subset , denotes the interior of , and denotes its boundary. Let be a convex function. As usual, denotes the effective domain of , and denotes its subdifferential [51]. If is twice differentiable, then and denote its gradient and Hessian, respectively. For a given twice differentiable convex function , if such that , we define a local norm and its dual norm associated with as in [44]:
[TABLE]
for any . Clearly, . This is the weighted norm with . For a real number , we use to denote the integer less than or equal to . We use for any real number .
Given a nonempty convex set in , and a point , the distance from to corresponding to the weighted-norm is defined as . For a given convex function , we say that is -strongly convex if remains convex, where is called the strong convexity parameter of . We say that is -smooth (i.e., Lipschitz gradient continuous) if is differentiable on and is Lipschitz continuous with a Lipschitz constant , i.e., for all . We denote the class of -strongly convex and -smooth functions by . A convex function is Lipschitz continuous on with a Lipschitz constant if for all .
2.2 Scaled proximal operators
Let be a proper, closed, and convex function, and . We define the following scaled proximal operator [17] of :
[TABLE]
The optimality condition of this minimization problem is , which can be written as , or . When , where and is the identity matrix, becomes a classical proximal operator [3, 49], and is usually denoted by . An important property of is its nonexpansiveness
[TABLE]
for any in . We say that is proximally tractable if can be efficiently evaluated, e.g., in a closed form or by a low-order polynomial time algorithm (e.g., ). Computational methods for evaluating this scaled proximal operator and its classical forms can be easily found in the literature including [17, 49].
2.3 Lipschitz continuity w.r.t. local norm
Let and its dual norm be defined by a strictly smooth convex function , and be a proper, closed, and convex function.
Definition 2.1*.*
We say that is -Lipschitz continuous w.r.t. with a Lipschitz constant , if for any , we have .
As a concrete example, assuming that is a strongly convex quadratic function, then is Lipschitz continuous in -norm if and only if is Lipschitz continuous w.r.t. the local norm defined by .
Lemma 2.2*.*
A proper, closed, and convex function is -Lipschitz continuous w.r.t. with a Lipschitz constant on if and only if for any and .
In particular, if is strongly convex with a strong convexity parameter and is Lipschitz continuous in -norm with a Lipschitz constant i.e. for any y,z\in\mathrm{dom}(f)\cap\mathrm{dom}(g)$$), then is -Lipschitz continuous w.r.t. on with the Lipschitz constant . However, the converse statement does not hold in general.
Proof 2.3*.*
For any and , we have
[TABLE]
Conversely, by convexity of , we have . By exchanging and , we finally get .
If is strongly convex with a strong convexity parameter , then we have for any . Therefore, we have . This shows that . Hence, is -Lipschitz continuous w.r.t. with .
As an example, if is contained in an affine subspace defined by , and is uniformly positive definite on , then is -Lipschitz and Lemma 2.2 still holds, but is still non-strongly convex. In this case, we say that is restricted strongly convex. Lemma 2.2 shows that the Lipschitz continuity w.r.t. the local norm of is weaker than the global Lipschitz continuity of since we only require the condition to hold on .
2.4 Fundamental assumption and optimality condition
Throughout this paper, we rely on the following fundamental assumption:
Assumption 1**.**
*. The solution set of (1) is nonempty. *
Assumption 1 is a standard one that is required in any solution method. Throughout this paper, we assume that Assumption 1 holds without recalling it.
The optimality condition associated with (1) becomes
[TABLE]
This condition is necessary and sufficient for to be an optimal solution of (1). For any , we can reformulate this optimality condition as a fixed-point condition:
[TABLE]
This formulation shows that is a fixed point of .
3 A Conceptual Homotopy Proximal Variable-Metric Framework
In this section, we introduce a novel parameterization of the optimality condition (6) and propose a conceptual framework for designing homotopy proximal variable-metric methods for solving (1).
3.1 Parametrization of the optimality condition
Given , we compute a subgradient . Then, we parameterize as follows:
[TABLE]
where . Clearly, , , and . In addition, for any . Note that if we can choose such that , then in (7) reduces to .
Next, we consider the following composite convex optimization problem derived from (1):
[TABLE]
This problem is similar to (1) and can be considered as a parametric perturbation instance of (1). The optimality condition of this problem is given by
[TABLE]
which is necessary and sufficient for to be an optimal solution of (8). We call this condition a parametric optimality condition of (1).
From the optimality condition (9), we can show that
- •
If , then (9) becomes , which is exactly the original optimality condition (6) of (1). Hence, is an exact optimal solution of (1).
- •
If , then (9) reduces to . Therefore, we can choose , the initial point, as an optimal solution of (8) at .
Our main idea is to start from a small value and follow a homotopy path on to find an approximate solution of at . As we will show later, we do not start from , but from a sufficiently small value .
Note that, when , we can write (9) as
[TABLE]
If , an -regularizer, for a given regularization parameter , then when is close to zero, the weight on is large and the solution of (8) is expected to be very sparse. This potentially can reduce the computational complexity of the underlying optimization method by working on sparse vectors or matrices. This property of the regularizer is also expected in other applications such as low-rank and group sparsity models.
Remark 3.1*.*
The formulation (9) is new and it does not reduce to any existing homotopy formulation including [41, Formula 4.2.26] to the best of our knowledge. This formulation is expected to lead to more efficient homotopy-type algorithms for solving sparse and low-rank convex optimization as explained above.
3.2 A fixed-point interpretation of the parametric optimality condition
Recall the optimality condition (9) given as . By using the scaled proximal operator , we can reformulate (9) into a fixed-point problem:
[TABLE]
for any . Let us define the following mapping for any :
[TABLE]
Clearly, (10) is equivalent to . We call the scaled generalized gradient mapping of the parametric problem (8). The most common case is as mentioned above for some . Then, reduces to the standard generalized gradient mapping [41].
3.3 Conceptual framework of homotopy proximal variable-metric methods
We first describe our conceptual three-stage proximal variable-metric algorithm as in Algorithm 1.
We will provide the details of each stage in the sequel based on some appropriate assumptions for (1). The main step of Algorithm 1 is (12), where we need to evaluate the scaled proximal operator . Depending on the choice of the variable matrix , we obtain different methods:
- •
If is diagonal, then we obtain a homotopy proximal gradient method.
- •
If approximates , then we obtain a homotopy proximal quasi-Newton method.
- •
If , then we obtain a homotopy proximal Newton method.
The choice of an initial point , the initial value of the parameter , the update rule of , and the approximation rule of (12) in Algorithm 1 will be specified in the sequel.
3.4 Inexact proximal Newton scheme
Let . The exact evaluation of the scaled proximal operator in (12) is equivalent to solving the following convex subproblem:
[TABLE]
where \nabla{f}_{k+1}(x^{k}):=f(x^{k})-\big{(}\tfrac{1}{\tau_{k+1}}-1\big{)}\xi^{0}. In this case, we can write as
[TABLE]
the exact solution of (13).
When is nontrivial (e.g., not a linear function), we can only approximate the true solution of (13) by an approximation such that
[TABLE]
Here the approximation “” is defined explicitly next in Definition 3.2.
Definition 3.2*.*
Let be the exact solution of (13), and be a given accuracy. We say that is a -approximate solution to , denoted by as in (14), if
[TABLE]
Using this definition, we have the following result, see [60, Lemma 3.2.].
Fact 1*.*
. Consequently, combining this inequality and (15), we can show that if (15) holds, then
[TABLE]
The condition (15) can be guaranteed by using several optimization methods in the literature such as accelerated proximal-gradient [4, 41, 52], ADMM [8], or semi-smooth Newton-CG augmented Lagrangian methods [67]. In Subsection 6.3, we approximately compute (13) via solving its dual.
We define the following local distances to measure the distance of approximations and to the true solution of the parameterized problem (9):
[TABLE]
These metrics will be used to analyze the convergence of our methods.
4 Convergence and iteration-complexity analysis
We analyze the convergence and iteration-complexity of Algorithm 1 for solving (1) under three different subclasses of and .
4.1 Linear convergence for the smooth and strongly convex case
The first class of models is when and in (1) satisfies the following assumption.
Assumption 2**.**
*Assume that is -strongly convex and -smooth. The function is -Lipschitz continuous on . *
Under Assumption 2, Algorithm 1 only has Stage 2 and we do not need to perform Stage 1 and Stage 3 of Algorithm 1. We can start from any starting point . We show that Algorithm 1 achieves a global linear convergence rate. This result is stated in the following theorem, whose proof can be found in Apppendix B.1.
Theorem 4.1*.*
Under Assumption 2, let and be two constants such that and . For any given and , we define
[TABLE]
Let be the sequence generated by the exact scheme (12) in Algorithm 1, where is chosen such that and is updated by
[TABLE]
Then, for any , we have and . Consequently, both sequences and globally converge to zero at a linear rate.
The sequence also satisfies , where . Hence, globally converges to a solution of (1) at a linear rate.
Let us make some remarks on the result of Theorem 4.1. First, the condition is equivalent to . If we choose , then we have which leads to . In this case, if we define , then (12) becomes
[TABLE]
which reduces to a homotopy proximal gradient method.
To optimize the contraction factor, we need to minimize over . This gives us showing that . Hence, we must choose , and we obtain . Another simple choice of is H_{k}=\big{(}\frac{L+m}{2}\big{)}\mathbb{I}.
Next, note that the convergence rate of in (12) is slower than in the standard proximal variable-metric method. Its contraction factor is defined in (18). However, (12) possesses some computational advantages that the standard proximal variable-metric method does not have as we will discuss in Section 7.
The linear convergence rate under Assumption 2 is known from the literature for both gradient and Newton-type methods. Nevertheless, our method is new, which works on the parameterized function instead of . Our method also allows us to flexibly choose the variable matrix as long as it satisfies the condition of Theorem 4.1. Another appropriate choice of is a rank-one update as proposed in [6].
Remark 4.2* (Nesterov’s accelerated variant).*
Note that we can develop Nesterov’s accelerated variant for (12) under Assumption 2. In this case, the convergence factor in Theorem 4.1 will be improved from to . However, we skip this modification in this paper.
4.2 Linear convergence for self-concordant function without barrier
We consider the second case where and g satisfy the following assumption.
Assumption 3**.**
*The function in (1) is standard self-concordant see Definition A.1. The function is -Lipschitz continuous w.r.t. the local norm defined by with a Lipschitz constant in . *
Note that, in Assumption 3, we only require to be self-concordant, but not necessary a self-concordant barrier. The class of self-concordant functions is much larger than the class of self-concordant barriers. As indicated in Proposition A.2, any generalized self-concordant and strongly convex function is self-concordant. In particular, it covers a few representative applications presented in the introduction. For other examples, we refer the reader to [56, 61].
Under Assumption 3, Algorithm 1 has two stages: Stage 1 finds an initial point, and Stage 2 performs a homotopy scheme. We can skip Stage 3. For any initial value of , let us choose that solves the following inequation:
[TABLE]
Note that there always exists that solves (20). Theorem 4.3 below states the convergence of (14) under the Assumption 3. Its proof can be found in Appendix B.2.
Theorem 4.3*.*
Suppose that Assumption 3 holds for (1). Let and be two constants satisfying (20) and be the sequence generated by (14). Moreover, and are chosen such that with . Let us choose , and update the parameter as
[TABLE]
Then, for and . Therefore, the sequences and both globally converge to zero at a linear rate.
Moreover, there exists such that for all . Hence, the sequence also globally converges to zero at a linear rate.
The following result is a direct consequence of Theorem 4.3 and Lemma 2.2 when is -Lipschitz continuous in -norm, and is strongly convex and generalized self-concordant.
Corollary 4.4*.*
Assume that is generalized self-concordant as defined in Definition A.1 and is Lipschitz continuous with a Lipschitz constant in -norm instead of being -Lipschitz continuous w.r.t. . Assume additionally that is strongly convex with a strong convexity parameter . Then, the conclusion of Theorem 4.3 still holds with .
Remark 4.5*.*
(a) The -Lipschitz continuity of w.r.t. a local norm in Assumption 3 can be replaced by assuming that for some for any and . By Lemma 2.2, we can easily see that this condition is weaker than the -Lipschitz continuity of w.r.t. . For example, if , and , then for all . In this case, the conclusions of Theorem 4.3 still hold.
(b) Observe that in (21), if the rate of change from to is slow so that , then the rate of increment from to will become faster when increases.
4.3 Linear convergence under the self-concordant barrier of
When is a self-concordant barrier, we use a different analysis, and no longer require to be Lipschitz continuous as stated in the following assumption:
Assumption 4**.**
*The function is a -self-concordant barrier as defined in Definition A.1, and is proper, closed, and convex. For a given , either the analytic center of defined by (50) on the interior of the level set exists or . *
Under Assumption 4, Algorithm 1 also requires Stage 1 and Stage 2, while we can skip Stage 3. For any , we choose such that
[TABLE]
Here, ( is defined in Appendix A) if defined by (50) exists, and , otherwise. These two constants and always exist, and . The following theorem states the convergence of Algorithm 1 under the self-concordant barrier assumption on , whose proof can be found in Appendix B.3.
Theorem 4.6*.*
Let us choose such that (22) holds. Let be the sequence generated by Algorithm 1 using (14) with such that with . Let us choose , and update as
[TABLE]
Then, and . Therefore, the sequences and both globally converge to zero at a linear rate.
Moreover, if we choose such that (22) holds and always exist, then . Hence, the sequence converges to an optimal solution of (1) at a linear rate.
Theorem 4.6 shows the global linear convergence of our inexact proximal-Newton method for solving (1) under Assumption 4. Here, the constants and balance between the contraction factor and the step-size of the homotopy parameter . The choice of from (22) is conservative due to several rough estimates in our proof. In practice, can be chosen to be much smaller than one as observed in our numerical experiments.
5 Stage 1: Finding an appropriate initial point
While the variant of Algorithm 1 in Theorem 4.1 can start from any initial point , the variants in Theorem 4.3 and Theorem 4.6 require an appropriate initial point . More precisely, we need to choose an initial value and such that for a given . We consider two cases: is -strongly convex and is non-strongly convex.
5.1 Inexact damped-step proximal-Newton scheme
We can apply the following inexact damped-step proximal-Newton scheme proposed in [60] to find . Let us start from any initial point , compute a subgradient , and update:
[TABLE]
Here, is the accuracy level defined as in Definition 3.2 and we use the “hat” notation for the iterates to distinguish this procedure from Algorithm 1. The following proposition provides an estimation on the number of iterations needed to find the initial point , whose proof can be found in [60, Lemma 4.3.].
Proposition 5.1*.*
Let be generated by (24) with , then after at most
[TABLE]
iterations, we obtain such that , where , and .
Proposition 5.1 suggests that we can perform a finite number of damped-step proximal-Newton scheme (24) to find such that . Hence, is an initial point that satisfies the conditions of Theorem 4.3 and Theorem 4.6.
5.2 Strong convexity of
If is strongly convex with a strong convexity parameter , and is not an optimal solution of (1), then we can choose as
[TABLE]
Here, is the maximum eigenvalue of . We will show in Appendix B.4 that satisfies . Hence, Algorithm 1 can start from an arbitrary point .
5.3 Non-strong convexity of and strong convexity of
We adopt our recent idea in [62] to develop a homotopy scheme to find this initial point in a finite number of iterations.
Starting from any , we consider the following auxiliary optimality condition depending on a new homotopy parameter and a fixed value :
[TABLE]
for any . Clearly, when , (26) reduces to (9) at and . When , (26) becomes , which shows that is a solution of (26). By applying the homotopy method starting from , and decreases to zero, we obtain an approximation to . The main step of this scheme is given as follows:
[TABLE]
where the approximation “” is defined as in Definition 3.2, and is a starting value of . We also use the “hat” notation for the iterates to distinguish this procedure from Algorithm 1.
This scheme is slightly different from (14) with the additional term . The following theorem shows us how to choose and update to guarantee , whose proof is given in Appendix B.5.
Theorem 5.2*.*
Assume that is self-concordant and -strongly convex with . For any given , we defined Let be an arbitrary starting point, , and be chosen such that
[TABLE]
Let be the sequence generated by (27) starting from this and . Support further that is updated by t_{j+1}:=\big{[}t_{j}-\frac{\Theta}{L_{g}(1+\Theta)}\big{]}_{+} and satisfies . Then, after at most iterations with , we have , and .
Theorem 5.2 shows that to find an initial point for Algorithm 1 such that , we only need a finite number of iterations as defined in Theorem 5.2. Moreover, in this case, we can take in Algorithm 1.
5.4 Implementation remarks for Algorithm 1
Theoretically, the variants of Algorithm 1 stated in Theorem 4.3 and Theorem 4.6 require a good starting point such that . To find this point, we can use either (24) or (27). However, since we know that when , , in practice we can choose to be sufficiently small such that , and skip Stage 1.
Practically, we only perform two stages as follows:
- •
Skip Stage 1 and choose sufficiently small such that is small.
- •
In Stage 2, we choose to guarantee that instead of . Then we update from to .
- •
In Stage 3, we fix and perform a couple of iterations to reach .
We only perform Stage 3 if we choose . In this case, we only have . To achieve , we need to perform a few proximal-Newton iterations with fixed .
6 Primal-Dual-Primal Method
Our second idea is a primal-dual-primal approach to solve (1). We propose a primal-dual-primal method which consists of the following steps:
- •
Construct the Fenchel dual problem (29) of (1).
- •
Apply Algorithm 1 to solve the dual problem (29).
- •
Instead of solving the dual subproblem (13), we dualize it to go back to the primal space.
- •
Construct an approximate primal solution of (1) from its dual approximate solution.
We will show in Section 6.6 that this approach is useful for the well-known model (2). Now, we present this method in detail as follows.
6.1 The dual problem
We assume that , where is a proper, closed, and convex function from , and is a linear operator such that . The dual problem of (1) in this case becomes
[TABLE]
where and are the Fenchel conjugates of and , respectively.
Let us define . Then, we can compute the gradient and Hessian of as
[TABLE]
We impose the following assumption.
Assumption 5**.**
*The function in (1) is self-concordant as defined in Definition A.1, and , where is a proper, closed, and convex function and is a linear operator such that . In addition, has full-row rank. *
Under Assumption 5, the function is still a self-concordant function as stated in [44, Theorem 2.4.1] for defined as
[TABLE]
We define the local norm with respect to as and its dual norm . The optimality condition of the dual problem (29) becomes
[TABLE]
which is necessary and sufficient for to be an optimal solution of (29) if .
Let be an optimal solution of (29). Then, from (31), if we define
[TABLE]
then , which leads to . On the other hand, we have , which leads to . Combining both expressions, we have . Therefore, given by (32) is an exact solution of the primal problem (1).
6.2 The homotopy proximal Newton method methods for the dual problem
To fulfill the assumptions of Theorems 4.3 and 4.6, we assume that one of the following conditions holds:
- •
satisfied Assumption 5 and is -Lipschitz continuous w.r.t. defined by .
- •
satisfied Assumption 5 and is -self-concordant barrier.
One can show that is -Lipschitz continuous if is bounded w.r.t. the local norm defined by , i.e. there exists such that for any . Since the dual problem (29) has the same property as the primal one (1) under the above assumptions, let us apply Algorithm 1 with to solve this problem, which leads to
[TABLE]
where \nabla{\varphi_{\tau_{k+1}}}(y^{k}):=\nabla{\varphi}(y^{k})-\big{(}\frac{1}{\tau_{k+1}}-1\big{)}\xi^{0} with . Here, is an approximation to the true solution as defined in Definition 3.2, where is given by
[TABLE]
and both the gradient mapping and the Hessian mapping of are given in (30), respectively. This problem in general does not have a closed form solution. But observe that (34) is a convex composite quadratic programming problem for which highly advanced algorithms such as the semismooth Newton augmented Lagrangian method developed in [37, 69] can be designed to solve it efficiently, as we shall demonstrate later in the numerical experiments.
6.3 The dualization of the subproblem (34)
Instead of solving the dual subproblem (34) directly, we dualize it to obtain the following subproblem in the primal space of :
[TABLE]
where
[TABLE]
Clearly, this problem is again a composite strongly convex quadratic program of the same form as (34), but in the primal space of . Specially, if , the identity matrix, then (35) is in the primal space of as in (13).
6.4 Solution reconstruction for (34)
Recall that denotes the exact solution of (35), then we can construct
[TABLE]
as an exact solution of (34).
Assume that we can only solve (35) up to a given accuracy . In this case, we say that is a -approximate solution to of (35) if for any such that , we have
[TABLE]
To guarantee (37), we can apply inexact first-order methods to solve (35), see, e.g., in [52, 63].
If satisfies (37), then we can construct an approximate solution to as
[TABLE]
The following lemma shows a relation between of (35) and the approximate solution of (34), whose proof is given in Appendix B.6.
Lemma 6.1*.*
Let be a -approximate solution to of (35) in the sense of (37). Then, constructed by (38) is also a -approximate solution to the true solution of (34) such that .
6.5 Primal solution recovery
Finally, we show how to recover an approximate primal solution of the original problem (1) from its dual approximate solution . Based on (32), we show below that for an approximate solution to , the following point
[TABLE]
is an approximate solution to the true solution of (1) as stated in the following theorem whose proof is given in Appendix B.7. In particular, if is an invertible matrix, then one can show that we can construct an approximate solution to of (1) from .
Theorem 6.2*.*
Let be an exact solution of the dual problem (29). Then
constructed by (32) is an exact solution of (1).
Let be computed by (38) and be given by (39) such that . Then
[TABLE]
Consequently, under the conditions of Theorem 4.3 or Theorem 4.6, the sequence converges linearly to the optimal solution of (1).
Let be an approximate solution of (35). If , then
[TABLE]
Assume that we apply Algorithm 1 to solve the dual problem (29) under the assumptions of Theorem 4.3 or Theorem 4.6 and the choice . If, in addition, is invertible, then is an approximate solution to of (1). Moreover, \big{\{}\|x^{k+1}-x^{\star}\|^{\ast}_{y^{k}}\big{\}} converges linearly to zero.
From Theorem 6.2, we can see that if is invertible, then we can directly use to approximate the solution of (1). Otherwise, we can construct an approximate solution to by using (39), which requires one evaluation of .
6.6 Applications to covariance estimation
In this section, we apply Algorithm 1 and the primal-dual-primal method in Section 6 to solve the regularized covariance estimation problem (2) as in [18] and its least-squares extension in [32].
We recall the primal regularized covariance estimation problem given in (2). Associated with (2), we can also consider its dual form:
[TABLE]
Here, is the Fenchel conjugate of . This problem again has the same form as (1). Instead of solving the primal problem (2), we apply Algorithm 1 to solve the dual problem (42) and reconstruct a solution of (2) from its dual.
6.6.1 The main steps of the algorithm
Given such that , we define . The main step of the algorithm is to solve the following subproblem
[TABLE]
where \widehat{X}_{k}:=X_{k}-\Xi_{k}\equiv X_{k}-\big{(}\tfrac{1}{\tau_{k+1}}-1\big{)}\Xi_{0} for a fixed . As discussed in Section 6, instead of solving (43), we look at its dual form
[TABLE]
where . Once is computed from (44), we can reconstruct as follows:
[TABLE]
and compute an inexact Newton decrement
[TABLE]
Finally, when an -solution of (42) is computed (i.e. ), we can reconstruct an approximate solution of the primal problem (2) by taking . This computation requires the inverse of a symmetric positive definite matrix, which can be done efficiently by Cholesky decomposition. However, as shown in Theorem 6.2, we can use computed by (44) to approximate the true solution . This allows us to avoid the matrix inversion .
6.6.2 The algorithm
Putting together these steps, we obtain a new algorithmic variant for solving (2) as presented in Algorithm 2.
Let us highlight some new features of Algorithm 2 as compared to existing methods in the literature, e.g., [18, 25, 26, 59, 61].
- (a)
Firstly, Algorithm 2 deals with a general regularizer compared to [18, 25, 26]. When is the -norm regularizer, we can apply coordinate descent methods as in [18, 25, 26] for solving (44) to improve its practical performance.
- (b)
Secondly, Algorithm 2 relies on Algorithm 1 to solve the dual problem (42) instead of standard proximal-Newton methods. It has a linear convergence rate compared to the damped-step scheme which only has a sublinear convergence rate as shown in [59, 61].
- (c)
Thirdly, it does not require any linesearch or any additional assumption in our analysis to achieve a linear convergence rate.
- (d)
Fourthly, the whole algorithm does not require any matrix inversion or Cholesky decomposition as long as we can solve the subproblem (44) with a first order method. This is an important feature for designing parallel and distributed variants of Algorithm 2 as compared to [26].
- (e)
Finally, the subproblem (44) works on the original regularizer instead of the dual problem as in [59], which preserves the structure such as sparsity on the iterates as promoted by the regularizer .
7 Numerical experiments
We provide some numerical experiments to illustrate our theoretical development. Our experiments are implemented in Matlab 2018a running on a Dell Optiplex 9010, 3.4 GHz Intel Core i7-3770 with 16GB 1600 MHz DDR3 memory.
7.1 Lipschitz gradient and strongly convex models
Now we evaluate the performance of the homotopy proximal-Newton scheme (12) by applying it to solve the following logistic regression problem with an elastic-net regularizer:
[TABLE]
where and are two regularization parameters, and , is a given dataset. As shown in [70], the elastic-net regularizer helps to remove variable limitation with more freedom than the classical LASSO model, and it can also carter for groups of nonzero variables. Clearly, f(x):=\frac{1}{n}\sum_{i=1}^{n}\log\big{(}1+\exp(-y_{i}a_{i}^{\top}x)\big{)}+{\frac{\mu_{f}}{2}}\|x\|^{2} is -strongly convex, and -Lipschitz gradient continuous with , where Moreover, the function is -Lipschitz continuous with . Hence, Assumption 2 of Theorem 4.1 is satisfied.
We implement Algorithm 1 to solve (47) and compare it with homotopy quasi-Newton variant, standard proximal-gradient scheme [4], and the accelerated proximal-gradient method with line-search and restart [4, 5, 55]. These methods are abbreviated as “HomoPN”, “HomoQuasiPN”, “PG”, and “Ls-Rs-APG”, respectively. We test these algorithms on several binary classification datasets a1a, a9a, w1a, w8a, covtype.binary, news20.binary, rcv1.binary and real-sim from [10], and mnist17 and mnist38 from the mnist dataset where we choose the digits and , and the digits 3 and 8. The details of these dataset set is given in Table 1.
Following [13], we set . The parameter for -regularization is selected to produce about percent of nonzeros coefficients. We should mention here that the subpropblem in the model (47) is an elastic-net regularized least-squares problem. For Algorithm 1 to be numerically efficient, it is crucial for us to solve those subproblems efficiently. Fortunately we can adapt the highly efficient semismooth Newton augmented Lagrangian method in [37] to solve the subproblems (12). We terminate the experiments when the relative gaps are less than a given tolerance , based on the KKT system of (47). Moreover, for PG and Ls-Rs-APG methods, we set the maximum number of iterations at . If any method does not achieve our desired accuracy after at most iterations, we use ”—” to represent the result. Our final results are reported in Table 2, where iter is the number of iterations, time[s] is the computational time in second, and rgap is the relative gap times . We highlight that (47) can be solved effectively by a stochastic method to a modest level of accuracy when the number of data points is large. However, in our experiments, we only focus on relatively moderate datasets as we are interested in solving the problems accurately to evaluate the performance of Algorithm 1, and ignore the comparison with stochastic methods.
Table 2 shows that since is strongly convex and has continuous Lipschitz gradient, the standard proximal-gradient method PG can solve some of the problems slightly better than the homotopy quasi-Newton method HomoQuasiPN, particularly for some of the datasets where the number of features is large (e.g., news20 and real-sim). Note that since the problem (47) is strongly convex, both PG and Ls-Rs-APG have linear convergence rate. Ls-Rs-APG outperforms PG in terms of iteration numbers as stated by the theory. But since we adopt the line search scheme, we found that the times taken by these two methods in solving some of the problems are roughly at the same level. However, HomoQuasiPN is promising when the number of features is moderate but the number of observations is large, for instance the datasets mnist17 and mnist38. The reason is that when the number of observations is large, the cost of computing the for HomoPN will be more expensive than that in HomoQuasiPN. Hence, with warm start, HomoQuasiPN will only need a few more iterations to converge but with lower computing cost in each iteration. In other situations, our HomoPN outperforms all other methods in terms of iteration numbers and computation time. Furthermore, our homotopy methods can often solve the problems more accurately. To conclude, our homotopy methods are highly efficient for solving a varieties of large-scale datasets in logistic regression.
7.2 Self-concordant barrier models
We illustrate the variant of Algorithm 1 in Theorem 4.6 for solving the following constrained convex problem with a self-concordant barrier function arising from D-optimal experimental design, see, e.g. [24, 39]:
[TABLE]
where for are symmetric positive semidefinite matrices. If we define f(x):=-\log\det\Big{(}\sum_{i=1}^{p}x_{i}A_{i}\Big{)}, and , where is the standard simplex in , then we can reformulate (48) into (1), and satisfies the assumptions of Theorem 4.6.
We implement Algorithm 1 to solve (48), and compare it with the interior-point method in [39], where their code is available online at http://www.mypolyuweb.hk/ tkpong/OD_final_codes/. To approximately solve the proximal-Newton subproblem (44), we adapt the semi-smooth Newton-CG augmented Lagrangian method in [37, 69] to solve the composite convex QP problem where the nonsmooth term is given by . We also compare our method with the multiplicative method proposed in [24] and the interior-point method implemented in SDPT3-v.4.0 [57]. We abbreviate these solvers by HomoPN, MUL, IP, and SDPT3, respectively. Unlike other well-established IP solvers, SDPT3 allows us to handle directly the log-determinant functions without reformulation or approximation.
We follow the same procedure as in [39] to generate the data and use the implementation of MUL from [39]. More precisely, we consider the following four design spaces:
[TABLE]
where and .
For each design space, we set for , with for , and for . In this case, the problem dimension is in , and and in . The performance of these four methods on problems of different sizes are reported in Table 3, where (#Iterations) and Time[s] denote the number of iterations and computational time taken, respectively, and is the approximate optimal objective value attained for (48).
In Table 3, the objective value is rounded off to seven significant digits. We can see that our homotopy method, HomoPN outperforms the multiplicative algorithm MUL in terms of computational time, and achieving much smaller objective values in all the instances. Our HomoPN also outperforms the interior point method (IP) and SDPT3 in terms of time and also gives slightly better objective values in most instances.
To see the performance of our method compared to MUL and IP, in the following test, we extend the dimension of datasets in , , and as follows:
[TABLE]
We first run MUL up to iterations, and check whether the HomoPN and IP methods can solve the problem in terms of the function value. We terminate the algorithms when the objective value is less than the value procedured by MUL. When the IP method cannot achieve this objective value, we terminate it and record the time and objective value after at most iterations (this is a large number of iterations for interior point methods). The computation results are presented in Table 4. For these larger dimensional problems, SDPT3 fails to solve them to the required accuracy, and we do not add this method to our experiments.
From Table 4, we can see that our algorithm outperforms both the MUL and IP methods in terms of computation time, while the IP method cannot achieve the desired accuracy when the dimension of the matrix variable is increased. It also illustrates the ability of our method to solve the problems to an intermediate accuracy. The reason why the IP method cannot solve the problem is that when the dimension is large, the Newton system is very ill-condition, and the solver cannot handle the ill-conditioning difficultly satisfactorily.
7.3 Non-Lipschitz gradient models
Next, we illustrate the variant of Algorithm 1 stated in Theorem 4.3 under Assumption 3 through a non-Lipschitz gradient model. We consider the following problem from Poisson regression [27, 29]:
[TABLE]
where , are given, and and are given regularization parameters.
Let . Note that if , then by Proposition A.2 in Appendix A, is -self-concordant with . Moreover, it is also -strongly convex, and is -Lipschitz continuous with . Hence, problem (49) satisfies Assumption 3. Consequently, the results of Theorem 4.3 hold for this problem.
We implement Algorithm 1 to solve (49) and compare it with a quasi-Newton method using BFGS with linesearch scheme and a limited-memory quasi-Newton method (L-BFGS) with both linesearch and restarting scheme. We name these three schemes HomoPN, BFGS, and L-BFGS-LS-R, respectively. Similar to the above example, we adapt the semi-smooth Newton-CG augmented Lagrangian method in [37, 69] to approximately solve the convex composite quadratic programming subproblem (13) in HomoPN.
Note that since the gradient is not Lipschitz continuous, first-order methods such as proximal gradient-type methods are not applicable. Our experiment reveals that first-order methods such as accelerated Barzilai-Borwein step-size proximal gradient algorithms often failed to converge due to the explosion of the estimated Lipschitz constant of .
We test three algorithms on datasets downloaded from the UCI dataset repository [38] and Kaggle (https://www.kaggle.com/datasets). Such datasets are used to predict the number of communications for a social post (news and facebook), the review score of a hotel or wine (vegas and wine), the number of goals scored in a game season or the number of 911 calls in a period of time. The sizes of these datasets are given in Table 5. Since these are raw datasets, to use them in our model (49), we perform a sequence of data-preprocessing routines to make our feature matrices well scaled by using methods such as one-hot encoding and min-max scaling.
We test three algorithms: HomoPN, BFGS, and L-BFGS-LS-R on the datasets in Table 5. The computation results are reported in Table 6, where is the objective value of (49). We can see from the results that while all methods we have tested can solve the problem using real datasets (they all reached almost the same optimal function values), the performance of the two quasi-Newton methods are similar and our homotopy method outperforms them both in terms of computational times and number of iterations. Moreover, based on our observation, when the problem is hard (i.e. the linear system is very ill-conditioned), the number of iterations in the quasi-Newton-type methods increases rapidly. However, our homotopy method is relatively stable in terms of the number of iterations, as was indicated in Section 7.4. To conclude, our method is highly efficient for real datasets.
7.4 The initial point independence of Algorithm 2
Our goal in this example is to observe the independence of performance w.r.t. to the initial point in our inexact primal-dual-primal homotopy PN scheme, i.e. Algorithm 2, compared to standard proximal-Newton methods for sparse optimization. We demonstrate this observation on the sparse inverse covariance estimation problem covered by (2) which perfectly fits our assumptions.
As mentioned, in theory, we have shown that Algorithm 2 can achieve a global linear convergence rate. In practice, however, this rate may still be slow. We instead update the homotopy parameter in Algorithm 1 by a longer step based on a linesearch such that the new iterate remains in . This trick allows us to go faster from to within a few iterations instead of using the worst-case factor. Then, we fix the homotopy parameter and apply a few full-step proximal-Newton iterations to compute the desired solution (Stage 3 of Algorithm 1).
We implement Algorithm 2 and compare it with the primal proximal-Newton method in [61]. We abbreviate these methods by HomoPN and Primal PN, respectively. We use a restarting accelerated proximal-gradient algorithm to approximate the proximal Newton direction. We generate two problem instances using the same procedure as in [36] with and , respectively. In order to obtain a desired sparse solution of (2), we choose . To see the affect of the initial point on the methods we generate two different initial points .
- •
Case 1 (dense): , where is the pseudo-inverse of .
- •
Case 2 (sparse): a diagonal matrix.
Figure 1 shows the convergence of HomoPN and Primal PN for these two cases: the sparse initial point and the dense initial point .
Figure 1 shows that our new algorithmic variant (HomoPN) depends weakly on the initial point , while the primal proximal Newton method (Primal PN) in [61] strongly depends on the choice of . The convergence rate of HomoPN is divided into two parts: the homotopy part with a linear rate, and the refining part with a quadratic rate. Figure 1 shows the quadratic convergence on the relative objective residual of both algorithms when the iterates approach the optimal solution, but Primal PN (dense) requires a large number of iterations to reach its quadratic convergence region.
Acknowledgments
This work is supported in part by the NSF grant, no. DMS-16-2044 (USA).
Appendix A Self-concordant and generalized self-concordant functions
Our central concept is the self-concordance and self-concordant barrier introduced by Nesterov and Nemirovskii [41, 44], and extensions to generalized self-concordance studied in [56].
For a three-time continuously differentiable and convex function , let denote the third-order derivative along the direction . We denote this class of functions by .
Definition A.1*.*
For in , we say that:
is -self-concordant ($$M_{f}\geq 0$$) if for any and , we have
[TABLE]
If , then we say that is standard self-concordant.
is -generalized self-concordant if for any and , we have
[TABLE]
where if or , then we use the convention .
is a -self-concordant barrier if is standard self-concordant with , tends to as approaches the boundary of , and
[TABLE]
is a -self-concordant logarithmically homogeneous barrier function of if it is self-concordant barrier and for all and .
It is clear that affine and convex quadratic functions are standard self-concordant but not self-concordant barrier. The logistic, Poisson, and DWD models discussed in the introduction are generalized self-concordant with , but not self-concordant. Self-concordant barriers are often associated with convex sets such as cones and convex bodies. Several simple sets are equipped with a self-concordant barrier. For instance, is a -self-concordant barrier of , is a -self-concordant barrier of , and is a -self-concordant barrier of the Lorentz cone . The relation between self-concordant and generalized self-concordant is stated in the following proposition.
Proposition A.2*.*
[56] Let be an -generalized self-concordant and as defined in Definition A.1. Then, if is strongly convex with a strong convexity parameter , then is -self-concordant with .
Proposition A.2 shows that the regularized logistic, Poisson, and DWD models in the introduction become self-concordant due to the aid of a regularization term .
Let be a -self-concordant barrier of . Then
[TABLE]
is referred to as the analytical center of . If is bounded, then exists and is unique. Otherwise, we often add an artificial bound or consider on a sublevel set of in (1) to guarantee the existence of . Let be defined for a general self-concordant barrier, and be defined for a self-concordant logarithmically homogeneous barrier. Then, and for and if exists.
Let be a proper, closed, pointed, and convex cone. If is endowed with a -self-concordant logarithmically homogeneous barrier function , then its Legendre transformation [44]:
[TABLE]
is also a -self-concordant logarithmically homogeneous barrier of the anti-dual cone of . For instance, if , then (self-dual cone). A barrier function of is . Hence, is a barrier function of .
Appendix B Convergence and iteration-complexity analysis of Algorithm 1
We break up our analysis into several lemmas and theorems. The following lemmas provide several key estimates for our proofs.
Lemma B.1*.*
Let be given, , and be the Lipschitz constant of . Let and be two solutions of (9) at and , respectively. Then the following results hold.
If is -strongly convex, then we have
[TABLE]
If is self-concordant, then we have
[TABLE]
Let and be two solutions of (26) at and , respectively. If is self-concordant, then
[TABLE]
Let be a solution of either (9) or (26). If is self-concordant, and and are chosen such that , then .
If is -strongly convex, and is a solution of (9), then
[TABLE]
Proof B.2*.*
(a) From (9), we have . Using the monotonicity of and , we have , which implies
[TABLE]
If is -strongly convex, then using the Cauchy-Schwarz inequality, the last inequality leads to
[TABLE]
which can be simplified as . This is exactly the first estimate of (51).
Using (9) again with we have . Combining this expression and , and using the monotonicity of , we have
[TABLE]
Rearranging this inequality, we obtain
[TABLE]
If is -strongly convex, then we have . Combining this estimate with the last inequality, and then using the Cauchy-Schwarz inequality, we obtain the first inequality of the second line of (51).
Using again (9), we have . Since is -Lipschitz continuous, the last expression leads to , which implies (51).
(b) If is self-concordant, then we have
[TABLE]
With a similar proof as in (a), we have
[TABLE]
which is the first line of (52). The second line of (52) is proved similarly as that of (51) and using the fact that .
(c) The proof of (53) is very similar to the proof of (52) and we omit it here.
(d) Note that if is self-concordant, then we have due to [41, Theorem 4.1.5] as long as . If , then the last inequality implies that , which proves (d).
(e) By the choice of , we have . From (9), we also have
[TABLE]
Using the -strong monotinicity of and the monotonicity of , we have
[TABLE]
By the Cauchy-Schwarz inequality, we obtain , which is the desired result in (54).
Lemma B.3*.*
Let and be two given constants. Let be a given sequence such that for all . Then
[TABLE]
as long as the denominators are well-defined.
Proof B.4*.*
Define . Then, is equivalent to . By induction, we have . Hence, we can show that . This estimate leads to
[TABLE]
which proves (55).
Lemma B.5*.*
Assume that is -strongly convex and -smooth. Suppose are given parameters such that . Let be the sequence generated by (12) using such that . Then
[TABLE]
Proof B.6*.*
Using , , and in (10), we get x^{\ast}_{\tau_{k+1}}=\mathrm{prox}_{\frac{1}{\tau_{k+1}}g}^{H_{k}}\big{(}x^{\ast}_{\tau_{k+1}}-H_{k}^{-1}\nabla{f_{\tau_{k+1}}}(x^{\ast}_{\tau_{k+1}})\big{)}. Combining this expression and (12), then using the non-expansiveness of in (5), we can derive that
[TABLE]
Next, from the fact that and the strong convexity and smoothness of , we have
[TABLE]
Substituting these estimates into (57), we get
[TABLE]
Using again , the last inequality leads to
[TABLE]
This estimate can be simplified as in (56) with \omega:=\frac{1}{m}\big{(}(L-2\mu_{f})m+L_{f}^{2}\big{)}^{1/2}<1.
Lemma B.7*.*
Assume that is self-concordant. Let be generated by the inexact scheme (14) up to a given accuracy . Let and be defined by (17). If , then we have
[TABLE]
Proof B.8*.*
Note that . Using (10) with and , we have
[TABLE]
Define and . Then, we can rewrite (13) as
[TABLE]
Using (5), we can derive from (59) and (60) that
[TABLE]
First, we estimate the following term of (B.8) using definition of and as
[TABLE]
By the self-concordance of , with the same proof as in [41, Theorem 4.1.14], we can derive that
[TABLE]
Next, using [41, Corollary 4.1.4], we estimate as follows:
[TABLE]
provided that . Combining these two estimates, we can derive from (B.8) that
[TABLE]
Note that by the triangle inequality. Using this in the last inequality, and rearranging the result, we obtain
[TABLE]
Moreover, using the self-concordance of in [41, Theorem 4.1.5], we can also derive that
[TABLE]
Combining the last two inequalities and using the definition of , , and , we obtain
[TABLE]
which is exactly (58). Here, the right-hand side is well-defined since .
B.1 Proof of Theorem 4.1: Global linear convergence for smooth and strongly convex function
From (56) of Lemma B.5, we can write
[TABLE]
Using the estimate (51) of Lemma B.1 with and , we obtain
[TABLE]
Combining these two estimates and using the triangle inequality, we can derive that
[TABLE]
where , and . In addition, by using (51) of Lemma B.1 with , we have
[TABLE]
Now, let us assume that for some and is given in (63). In order to get , from (62), we impose the following condition:
[TABLE]
Since and , this inequality is equivalent to
[TABLE]
Define . Then, the last inequality leads to
[TABLE]
Using Lemma B.3 with and , we get
[TABLE]
Let us choose and such that . Using the formula of above, this condition is equivalent to
[TABLE]
Let us fix . Then (65) shows that we can choose as shown in (18) to guarantee (65), where .
With the choice of and , we have and . If we update as shown in (19), then the condition (64) holds. Moreover, , where the right-hand side converges to zero as tends to .
Finally, since , by the triangle inequality, we have
[TABLE]
Hence, we obtain with . This is exactly the last conclusion of Theorem 4.1.
B.2 Proof of Theorem 4.3: Linear convergence under Assumption 3
Let and be defined by (17) and let be the local distance between the true solutions and of (9). We divide the proof of Theorem 4.3 into the following steps:
Step 1: Upper bound on : Recall (58) from Lemma B.7 as follows:
[TABLE]
By the triangle inequality and [41, Theorem 4.1.5], we can also derive that
[TABLE]
Now, consider the function on . It is straightforward to numerically check that . Hence, we can overestimate (66) as follows:
[TABLE]
Using the fact that above, we overestimate again this inequality as
[TABLE]
where we require .
Now, if we impose , then implies that . Therefore, if we choose , then we can further simplify the above inequality to get
[TABLE]
as long as and .
Step 2: Upper bound on : Let us assume that for some . In order to guarantee , using the last inequality (67), we have to impose the following condition
[TABLE]
Provided that , the previous condition is equivalent to
[TABLE]
where .
Step 3: Update rule of : Using (52) with and , we obtain
[TABLE]
If we update as (21), then we can see that
[TABLE]
This guarantees (68).
Step 4: Conditions on and : Define and choose such that
[TABLE]
Note that the above condition holds if
[TABLE]
In turn, the condition (72) holds if we choose such that . Using the explicit expression of , we obtain that
[TABLE]
which is exactly the condition (20). It is not hard to show that this inequality always has a solution . Moreover, since , from (68), we also require for all . It is easy to check numerically that this condition is always satisfied for .
Step 5: Bound on : Next, using (71), the definition of , and Lemma B.3 with and , similar to the proof of Theorem 4.1, we can show that
[TABLE]
Hence 0\leq 1-\tau_{k}\leq\big{(}\frac{1-\tau_{0}}{\tau_{0}}\big{)}\sqrt{\sigma}^{k}.
Step 6: Linear convergence rate of : Finally, note that and , we can show from (69) and (73) that
[TABLE]
Since in (72) we have chosen such that , this inequality implies that
[TABLE]
where . Using the triangle inequality, the self-concordance of , and (74), we can derive that
[TABLE]
where . This inequality shows that converges to a solution of (1) at a linear rate.
B.3 Proof of Theorem 4.6: Linear convergence
when is a self-concordant barrier
Let as defined in Theorem 4.1. Using (52) with and , we get
[TABLE]
Since and due to the self-concordant barrier property of , the last inequality leads to
[TABLE]
Since we want to guarantee for and as in Theorem 4.3, similar to the proof of Theorem 4.3, we can choose
[TABLE]
where . Here, for if .
Now, from the update rule (23) of Theorem 4.6, we have
[TABLE]
This inequality shows that (76) automatically holds.
Let . It is easily to show that , where . By induction, we get . Using an elementary inequality , we have
[TABLE]
Hence, for any , if we choose such that \sigma\geq\big{(}1-\frac{\tau_{0}C_{0}}{(1-\tau_{0})(1+C_{0})(\sqrt{\nu_{f}}+\bar{c}_{0})}\big{)}^{2}, then we have .
Finally, note that , we can show from (75) that
[TABLE]
This shows that
[TABLE]
where . Next, using the triangle inequality, the self-concordance of , and (77), we can derive
[TABLE]
This inequality shows that converges to a solution of (1) at a linear rate.
B.4 Proof of (25): The choice of
From (54), we have . Let be the maximum eigenvalue of . Hence, if we assume that
[TABLE]
then we have
[TABLE]
Since
[TABLE]
as long as , to guarantee that , we impose . The last condition is equivalent to , which shows that
[TABLE]
This condition is exactly (25).
B.5 Proof of Theorem 5.2: Finding an initial point
Let and Assume that , from (67), to guarantee , we impose the condition
[TABLE]
This condition holds if . Since , we have . Using (53) with and , we obtain that
[TABLE]
Since is -strongly convex, we have . Hence, we get .
To guarantee this condition, we impose , where . This allows us to update to . If we start from , then we have . Hence, if . This shows that after iterations, we obtain such that .
Finally, we show how to choose such that . Using (53) with , if , then we have
[TABLE]
By using Lemma B.1(e), if , then we have . To guarantee that , we require . Combing this estimate and (78), we can show that if . Therefore, we can choose
[TABLE]
to guarantee that .
B.6 Proof of Lemma 6.1: Approximate solution of the dual problem
Define the quadratic function
[TABLE]
It is easy to show that
[TABLE]
By using (37) and (38), we have , which implies . Hence, by the convexity of , we have \frac{1}{\tau_{k+1}}\big{(}\psi^{\ast}(y^{k+1})-\psi^{\ast}(\bar{y}^{k+1})\big{)}\leq\langle z^{k+1},y^{k+1}-\bar{y}^{k+1}\rangle. Using this inequality and (80), if we define , then we have
[TABLE]
Next, using (38), we have for . Substituting this expression into the last inequality, we obtain
[TABLE]
The last inequality implies that if , then . This completes the proof.
B.7 Proof of Theorem 6.2: An approximate solution for of (1)
The statement (a) is trivial. We now prove (b) and (c) as follows.
(b) From (32), (39), and the definition of in (30), we can show that
[TABLE]
Let . By the mean-value theorem and [41, Corollary 4.1.4], we can derive from the above expression that
[TABLE]
Here, we note that . This inequality leads to (40) provided that .
Since we apply (33) to solve the dual problem (29), by Theorem 4.3 or Theorem 4.6, the sequence satisfies for given constants and . Combining this relation and (40), we can show that converges linearly to an optimal solution of (1).
(c) First, since \nabla{\varphi}_{\tau}(y)=D\nabla{f^{\ast}}(-D^{\top}y)-\big{(}\frac{1}{\tau}-1\big{)}\xi^{0}, using (32) we can write
[TABLE]
Next, from (38) we can write
[TABLE]
Combining these expressions, we can further estimate
[TABLE]
provided that . Here, the last inequality follows from the self-concordance of with similar proof as [62, Theorem 1]. This proves (41).
Since we apply (33) to solve the dual problem (29), by Theorem 4.3 or Theorem 4.6, we have and for some constants and . In addition, for some constant . Using these bounds and in (41), we can show that
[TABLE]
Under the conditions of Theorem 4.3 or Theorem 4.6, we have for some and for , then we can easily show that
[TABLE]
provided that . If is invertible, then we can show that is an approximate solution to of (1). Moreover, converges to zero at a linear rate.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Z. Allen-Zhu , Katyusha: The first direct acceleration of stochastic gradient methods , Proceedings of The ACM SIGACT Symposium on Theory of Computing (STOC), (2017), pp. 1200–1205. Montreal, Canada.
- 2[2] O. Banerjee, L. El Ghaoui, and A. d’Aspremont , Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data , J. Mach. Learn. Res., 9 (2008), pp. 485–516.
- 3[3] H. H. Bauschke and P. Combettes , Convex analysis and monotone operators theory in Hilbert spaces , Springer-Verlag, 2nd ed., 2017.
- 4[4] A. Beck and M. Teboulle , A fast iterative shrinkage-thresholding agorithm for linear inverse problems , SIAM J. Imaging Sci., 2 (2009), pp. 183–202.
- 5[5] S. Becker, E. J. Candès, and M. Grant , Templates for convex cone problems with applications to sparse signal recovery , Math. Program. Compt., 3 (2011), pp. 165–218.
- 6[6] S. Becker and M. Fadili , A quasi-Newton proximal splitting method , Proceedings of Neutral Information Processing Systems Foundation (NIPS), 2012.
- 7[7] D. Bertsekas , Incremental gradient, subgradient, and proximal methods for convex optimization: A survey , Optimization for Machine Learning, (2011), pp. 1–38.
- 8[8] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization and statistical learning via the alternating direction method of multipliers , Foundations and Trends in Machine Learning, 3 (2011), pp. 1–122.
