A subgradient method with constant step-size for $\ell_1$-composite optimization
Alessandro Scagliotti, Piero Colli Franzone

TL;DR
This paper introduces a subgradient method with a constant step-size for -regularized convex optimization, achieving linear convergence in strongly convex cases and demonstrating effectiveness through numerical tests.
Contribution
It proposes a novel subgradient method with constant step-size for -regularized problems and an accelerated version with proven linear convergence.
Findings
Linear convergence for strongly convex smooth terms
Effective performance on both strongly and non-strongly convex examples
Accelerated algorithm with adaptive restart strategy
Abstract
Subgradient methods are the natural extension to the non-smooth case of the classical gradient descent for regular convex optimization problems. However, in general, they are characterized by slow convergence rates, and they require decreasing step-sizes to converge. In this paper we propose a subgradient method with constant step-size for composite convex objectives with -regularization. If the smooth term is strongly convex, we can establish a linear convergence result for the function values. This fact relies on an accurate choice of the element of the subdifferential used for the update, and on proper actions adopted when non-differentiability regions are crossed. Then, we propose an accelerated version of the algorithm, based on conservative inertial dynamics and on an adaptive restart strategy, that is guaranteed to achieve a linear convergence rate in the strongly convex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Optimization Algorithms Research
A subgradient method with constant step-size for -composite optimization
Alessandro Scagliotti and Piero Colli Franzone
Technical University of Munich (TUM) & Munich Center for Machine Learning (MCML), Germany
Dipartimento di Matematica, Università di Pavia, Italy
Abstract.
Subgradient methods are the natural extension to the non-smooth case of the classical gradient descent for regular convex optimization problems. However, in general, they are characterized by slow convergence rates, and they require decreasing step-sizes to converge. In this paper we propose a subgradient method with constant step-size for composite convex objectives with -regularization. If the smooth term is strongly convex, we can establish a linear convergence result for the function values. This fact relies on an accurate choice of the element of the subdifferential used for the update, and on proper actions adopted when non-differentiability regions are crossed. Then, we propose an accelerated version of the algorithm, based on conservative inertial dynamics and on an adaptive restart strategy, that is guaranteed to achieve a linear convergence rate in the strongly convex case. Finally, we test the performances of our algorithms on some strongly and non-strongly convex examples.
Keywords: convex optimization, -regularization, subgradient method, inertial acceleration, restart strategies.
Introduction
In this paper we deal with convex composite optimization, i.e., we consider objective functions of the form
[TABLE]
where is -regular with Lipschitz-continuous gradient, and is a non-smooth convex function. We recall that the concept of composite function was introduced by Nesterov in [14], and it usually denotes the splitting (0.1) in the case that the non-regular term is simple. In this framework, possible examples of simple functions include, e.g., the indicator of a closed convex set, or the supremum of a finite family of linear functions. The problem of minimizing such composite functions can be effectively addressed by means of forward-backward methods (see, e.g., [7]), and their accelerated versions [4]. In this regard, we report the recent contribution [20], where it is considered an accelerated method that achieves linear convergence when in (0.1) are strongly convex.
The aim of this paper is to develop a convergent subgradient method with constant step-size for the minimization of particular instances of (0.1). The subgradient method was first introduced in [24] and, given an initial guess , the algorithm produces a sequence with update rule
[TABLE]
where , i.e., it is an element taken from the subdifferential of the objective at the point , and denotes the step-size. If we set , we can equivalently rephrase (0.2) as
[TABLE]
where represents the step-length at the -th iteration. It is possible to deduce the convergence as soon as satisfies and (see [25, Chapter 2]). In [19, Theorem 5.2] it is proposed a construction for that achieves as when the value is known a priori. We insist on the fact that, in the results mentioned above, the vector can be any element of . If we now consider constant step-sizes, i.e., for every , in general we cannot expect the convergence of the iterates of (0.2) to a minimizer. For instance, given the one-dimensional function , for every choice , if the initial guess , then the sequence produced by (0.2) oscillates and it remains well-separated from [math]. From this example it is clear that, in order to work out a convergent subgradient method with constant step-size, it is crucial to identify the regions where the objective is non-differentiable, and to take proper actions when the sequence crosses them. Moreover, in our analysis a role of primary importance is played by the choice of the element used for the iteration.
Subgradient methods with constant step-size have already been considered in the convex optimization literature, and, typically, it is possible to prove that the iterates arrive to the sublevel set , where the quantity is related to the step-size . In a similar flavor, if the objective function is strongly convex, the sequence produced by the algorithm manages to reach a ball centered at the minimizer, whose radius depends on . For a presentations of these results, we refer the reader to [3, Section 3.2]. Moreover, under suitable assumptions on the growth of around the minimizer , it is possible to prove that the distance of the iterates to has a linear decay, up to a certain threshold (that, once again, is estimated in function of ). For further details, see [11, Theorem 1] and [8, Theorem 4.3]. Finally, we report the recent contribution of [12], where the authors study the stability of a subgradient method with constant step-size around local minimizers, when is non-smooth and non-convex. To the best of our knowledge, the one presented here are the first convergence results for a subgradient method with constant step-size.
In this paper, we devote our attention to the case where the non-regular term at the right-hand side of (0.1) consists in the -penalization, i.e., where we have with , and
[TABLE]
This kind of problem is well-studied since the presence of the -norm induces sparsity in the minimizer, and for this reason such minimization tasks easily arise in real-world applications. For instance, we recall [6] for signal processing applications, [29] for imaging problems, and finally [9, 27] for the -regularized logistic regression, which is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing.
In our approach, we take advantage of the structure of the points where the objective is non-differentiable. We recall that, in the case of -penalization, such points coincide with the set . Hence, at each iteration, if the current value has some null component, i.e., for some , we first decide which hyperplanes we move parallel to. This choice is authomatically done by selecting for the update (0.2) the direction , where denotes the element of with minimal Euclidean norm. The interesting situation occurs when some components strictly change sign when moving from to . In that case, we have to properly decide whether to allow (some of) these changes of sign, or to set the corresponding components equal to 0. We stress the fact that this phase is fundamental in order to avoid the oscillations that characterized the one-dimensional example reported above. For this method, described in Algorithm 1, we can establish a linear convergence result as soon as the regular function appearing at the right-hand side of (0.1) is strongly convex. To show that, we make use of a non-smooth version of the Polyak-Lojasiewicz inequality (see, e.g., [5, 28]).
Then, in Section 3, we propose a momentum-based acceleration of Algorithm 1, inspired by the restarted-conservative algorithm introduced in [22]. In the smooth convex framework, the idea of introducing momentum to accelerate the convergence of the classical gradient method dates back to the 1960s, with the works of Polyak [17, 18]. These methods, often called heavy-ball, can be interpreted as discretization of a second order damped mechanical system, where the objective function plays the role of the potential energy. In [26] it was shown that also the celebrated Nesterov accelerated gradient method (see [13]) can be interpreted in this framework. This led to a renewed interest in the interplay between discrete-time optimization algorithms and continuous-time dynamical models. In this context, in the mechanical system, the classical linear and isotropic viscosity friction is often replaced by a more general dissipative term. In this regard, we recall the contributions [1, 2, 23]. From the discrete-time side, in [16] the authors empirically observed that adaptively resetting to [math] the momentum variable (i.e., the velocity) can further boost the convergence. Motivated by this fact, in [22] it was considered a conservative dynamical model (i.e., without any dissipative term in the dynamics), whose convergence completely relies on a proper restart scheme. In Algorithm 2 we propose for composite functions with -penalization a new version of the restarted-conservative algorithm that has been heuristically outlined in [22], and in Section 3 we show that the per-iteration decay achieved by Algorithm 2 is always larger or equal than in Algorithm 1.
Finally, in Section 4 we test our algorithms in strongly and non-strongly convex optimization problems with -regularization.
1. Preliminary results
In this section we establish some auxiliary results that will be used later. Given a convex function , for every we denote with the subdifferential of at the point . We recall that
[TABLE]
Definition 1**.**
Let be a convex function. For every , we define the vector as follows
[TABLE]
Remark 1*.*
We observe that Definition 1 is always well-posed. Indeed, for every convex function , for every the subdifferential is a non-empty, compact and convex subset of . Namely, since we do not allow to assume the value , this fact descends directly from [15, Theorem 3.1.15]. Moreover, we can equivalently rephrase (1.1) as
[TABLE]
i.e., as a positive-definite quadratic programming problem on a convex domain. Hence, we deduce that is well-defined, and that it consists of a single element. Considering this last fact, in this paper we understand as a vector-valued operator, rather than a set-valued mapping.
We report below a non-smooth version of the celebrated Polyak-ojasiewicz inequality. We refer the reader to [17] and [15, Theorem 2.1.10] for the classical statement in the smooth case, and to [5, Section 2.3] and [28, Section 2.2] for the extension to non-differentiable functions.
Lemma 1.1**.**
Let be a -strongly convex function, and let be its minimizer. Then, for every and for every element of the subdifferential the following inequality holds:
[TABLE]
and, in particular, we have
[TABLE]
Proof.
Let us introduce the auxiliary function defined as
[TABLE]
The fact that is -strongly convex guarantees that is still a convex function. Moreover, for every we have that
[TABLE]
This follows immediately from the fact that for every , and from the sum rule for subdifferentials (see, e.g., [21, Theorem 23.8]), i.e., . For every and for every we compute
[TABLE]
where we used (1.3) and the subdifferential inequality for the convex function . Recalling that , from (1.4) we directly deduce the thesis. ∎
We now introduce the class of functions that will be the main object of our investigation. We consider a composite objective (see [14]) of the form
[TABLE]
where is a -regular convex function with Lipschitz-continuous gradient of constant , and where is a positive constant. We recall that for every . We observe that
[TABLE]
for every , where is the -th element of the standard basis of . If we define , we have that
[TABLE]
for every , where denotes the usual partial derivative of the regular term at the right-hand side of (1.5). From (1.7) we read that the -th component of is affected only by . Therefore, in order to compute the operator introduced in Definition 1, we can find separately the element of minimal absolute value of for . We use to access the -th component of . In particular, for every we have that
[TABLE]
Definition 2**.**
Given , we define the following partition of the components induced by the point :
[TABLE]
From now on, when making use of a partition of the indexes of the components , for every we write , where is the vector obtained by extracting from the components that belong to , i.e., for every . The next technical result is the key-lemma of the convergence proof of Section 2.
Lemma 1.2**.**
Let be a convex function of the form (1.5). Given , let be the partition of corresponding to the point and prescribed by (1.9). Let us consider a vector such that
[TABLE]
Then the following inequality holds:
[TABLE]
where is the Lipschitz constant of the regular term at the right-hand side of (1.5), and is defined as in Definition 1.
Remark 2*.*
We recall that, in the case of a regular convex function with -Lipschitz continuous gradient, we have
[TABLE]
for every (see, e.g., [15, Theorem 2.1.5]). The crucial fact for the proof of Lemma 1.2 is that, when satisfies the conditions (1.10), the segment lies in a region where the restriction of the objective is regular, where we set . Lemma 1.2 will be used to prove that, along proper directions, the objective function is decreasing.
Proof.
Before proceeding, we introduce another partition of the set of indexes :
[TABLE]
and we define
[TABLE]
where are set accordingly to (1.9). If we consider the segment for , it turns out that for every , where
[TABLE]
Let us define the auxiliary function as
[TABLE]
where is the smooth term at the right-hand side of (1.5). From the definition of , it follows that
[TABLE]
We observe that the function is as regular as , i.e., it is of class with -Lipschitz continuous gradient. Indeed, the first term at the right hand-side of (1.14) is obtained as the composition , where is the linear (-Lipschitz) orthogonal projection onto the subspace . Moreover, the last terms at the right hand-side of (1.14) are constant. Therefore, using the identity
[TABLE]
if we apply the estimate (1.12) to , we deduce that
[TABLE]
Therefore, the thesis follows if we show that the following equalities hold:
[TABLE]
Using the partition of the components provided by the families of indexes , , , , and , we have the following possibilities:
- •
If , in virtue of (1.8) and (1.13), we obtain .
- •
The case is analogous to .
- •
If , then and , and, in virtue of (1.10), we deduce that . In particular, using again (1.8), this implies that . On the other hand, recalling the expression of in (1.13) and the inclusion , we finally deduce .
- •
The case is analogous to .
- •
If , then , and we immediately obtain .
This argument shows that (1.15) is true, and it concludes the proof. ∎
2. Subgradient method and convergence analysis
In this section we propose a subgradient method with constant step-size for the numerical minimization of a convex function with the composite structure reported in (1.5). We insist on the fact that the analysis presented here holds only when the non-smooth term at the right-hand side of (1.5) is a -penalization.
Before introducing formally the algorithm, we provide some insights that have guided us towards its construction. Let be the current guess for the minimizer of . We want to find a suitable direction in the subdifferential such that , where represents a constant step-size. In order to accomplish this, a natural choice consists in setting , where is defined as in (1.1). To see this, we first observe that, in virtue of the particular structure of reported in (1.7), we can choose separately the components of the direction of the movement. If , then consists of a single element, hence the only possible choice is . If and , then the convex application attains the minimum at . Hence, any choice with would give , resulting in an increase of the objective function. For this reason, it is convenient to set , and to move tangentially to the non-differentiability region . On the other hand, if and, e.g., , then , and for every choice of , we have that . However, observing that , it looks natural to set once again .
Besides the selection of the direction , the second crucial aspect is whether some sign changes occur in the coordinates when moving from to . If not, the situation is pretty analogous to a step of the classical gradient descent in the smooth framework. On the other hand, if there is, e.g., a positive component that becomes negative, then we should carefully decide if the barrier should be crossed, or not. This is a key-point, in order to avoid the oscillations that characterized the simple example in the Introduction. In this case, we first set to [math] the components involved in a sign change, and for these components we re-evaluate . Finally, using this additional information, we complete the step, as depicted in Figure 1. The implementation of the method is described in Algorithm 1.
We now establish the linear convergence result for Algorithm 1 in the case of strongly convex objective.
Theorem 2.1**.**
Let be a function such that for every , where and is -regular. We further assume that there exist constants such that is -strongly convex and is -Lipschitz continuous. Let be the sequence generated by Algorithm 1. Then, there exists such that
[TABLE]
where denotes the unique minimizer of , and where we set the step-size .
Proof.
We follow the procedure described in Algorithm 1. We prove that each iteration leads to a linear decrease of the value of the objective function. The first stage of each step is based on the following update:
[TABLE]
where represents the step-size of the sub-gradient method. We distinguish two possible scenarios, corresponding to the if-else statement at the lines 5 and 7 of Algorithm 1.
Case 1. We have that
[TABLE]
i.e., none of the components of and of changes sign, in the sense that from strictly positive it becomes strictly negative, or vice-versa. If we set , we observe that the hypotheses of Lemma 1.2 are met for the point and the vector . Indeed, using the partition introduced in (1.9) and induced by the point , from (2.3) it follows that implies . A similar argument holds for . Finally, if , then satisfies (1.10) by construction. Therefore, from (1.11) we deduce that
[TABLE]
Moreover, if , in virtue of Lemma 1.1, we obtain that
[TABLE]
In this case, we assign and, choosing in order to minimize the right-hand side of the previous inequality, we get
[TABLE]
Case 2. Recalling the definition of in (2.2), we are in the second scenario when
[TABLE]
i.e., there is at least one component that strictly changes sign. Before proceeding, we introduce the following partition of the components:
[TABLE]
and we define the following intermediate points:
[TABLE]
and
[TABLE]
where
[TABLE]
We observe that (2.7) corresponds to the assignments of lines 9-10 in Algorithm 1, while (2.9) incorporates lines 11-12. Finally, is defined in (2.8) accordingly to line 13. We insist on the fact that in the update (2.8) the vector is computed by re-evaluating at the point . This is because may exhibit sudden changes when considering the points and . In this regard, our construction guarantees that we employ the most trustworthy values for the choice of the decrease direction . We point out that, if , then . Moreover, we remark that if and , then we have necessarily that . Indeed, in this case, from (2.2) and it follows that , while gives , resulting in .
Phase (1). From (2.7), we immediately observe that
[TABLE]
with
[TABLE]
and where, for every , we set
[TABLE]
We first notice that . Indeed, assuming that (otherwise there is nothing to prove), since , recalling (2.6) and (2.2), we have
[TABLE]
which in turn gives and, as a matter of fact, . On the other hand, in order to show that , we assume without loss of generality that . Then, using again (2.11), it follows that
[TABLE]
that yields . Therefore, we conclude that
[TABLE]
Finally, from (2.5) we deduce that there exists at least one index such that .
Using the partition of induced by the point and prescribed by (1.9), we obtain that the following conditions are satisfied:
- •
If , then either or . In the first case, , then . In the second, . Hence, in any case, .
- •
If , then an analogous reasoning as before yields .
- •
If , then . Hence, , and . Therefore, .
The previous argument proves that the vector introduced in (2.10) satisfies the assumptions of Lemma 1.2 at the point . Thus, we deduce that
[TABLE]
If we set , we observe that (2.13) implies that whenever . We stress the fact that the condition (2.5) that characterizes the present scenario guarantees that .
Phase (2). We now investigate the update described in (2.8)-(2.9). Let and be the partition of the components induced by the point and prescribed by (1.9). Recalling (2.6) and the definition of in (2.7), we observe that , and . Hence, since for every , from (1.8) it descends that that
[TABLE]
which, in virtue of (2.9), yields
[TABLE]
Moreover, using (2.9), (2.2) and (2.6), we deduce that
[TABLE]
On the other hand, from (2.9) and recalling that , we have that
[TABLE]
By combining (2.15) and (2.16), we obtain that the hypotheses of Lemma 1.2 are met when considering the point and the direction . Hence, it follows that
[TABLE]
On the other hand, recalling (2.9) and (2.14), we have that
[TABLE]
where we used the Lipschitz-continuity of , (2.10) and the fact that . If we set in (2.17), owing to (2.18) we deduce that
[TABLE]
Moreover, by combining the last inequality with (2.13) (using again ), we obtain that
[TABLE]
where we used (2.12) in the last passage. In virtue of Lemma 1.1, from (2.19) we deduce that
[TABLE]
We now distinguish two possibilities, corresponding to the if-else statement at lines 14 and 16 of Algorithm 1.
- •
If , then we set .
- •
If , then we set .
In any case, from (2.20) we obtain
[TABLE]
Finally, in virtue of (2.4) and (2.21), if we set
[TABLE]
we deduce the thesis. ∎
Remark 3*.*
The hypothesis of the strong convexity of the smooth function in Theorem 2.1 can be slightly relaxed by requiring that is convex, that the objective adimits a minimizer and that there exists a constant such that satisfies the inequality (1.2) for every . Indeed, in the proof of Theorem 2.1 we only employ (1.2), and we do not use the strong convexity assumption. On the other hand, the assumption of convexity for is needed for the notion of subgradient considered in this paper.
3. Accelerated subgradient method
In this section we propose a momentum-based acceleration of Algorithm 1 for an objective function with the -composite structure introduced in (1.5). As observed in the Introduction, in the smooth-objective framework it is possible to design minimization schemes with momentum by discretizing second order ODEs of the form:
[TABLE]
where represents the objective function, and is a positive semi-definite matrix that tunes the generalized viscosity friction. In [16] it was noticed that adaptive restart strategies can further accelerate the convergence to the minimizer, since they are capable of eliminating the oscillations typical of under-damped mechanical systems. The term adaptive restart denotes a procedure that resets to [math] the momentum/velocity variable (i.e., in (3.1)), as soon as a suitable condition is satisfied. In [22] it was considered a conservative dynamics by dropping the viscosity term, i.e, choosing in (3.1). Then, using the symplectic Euler scheme (see, e.g., [10]) to discretize the system, it was proposed the following conservative algorithm:
[TABLE]
where represents the discretization step-size. In the case of a regular and convex objective , the conservative scheme (3.2) achieves at each iteration a decrease of the function greater or equal than the classical gradient descent. This fact relies on the following restart strategy: “reset whenever ”. In [22] it was also investigated a heuristic extension of (3.2) to the case of a non-smooth objective with -composite structure, where was used in (3.2) in place of , i.e.,
[TABLE]
In this section, taking advantage of the observations done in Section 2 for the non-accelerated subgradient method, we propose a variant of the algorithm described in [22, Algorithm 4]. The main differences concern the way we manage the changes of sign in the components, and the condition for the reset of the momentum variable. Indeed, from (3.3) we deduce that
[TABLE]
where we set . Therefore, it is natural to divide every step of the accelerated algorithm into two phases:
- •
(subgradient phase). If sign changes in the components occur, we adopt the same procedures as in Algorithm 1.
- •
(momentum phase). Also in this phase, we have particular care of sign changes of the components.
Moreover, we use the general principle that “in the momentum phase we do not modify null components”. This is motivated by the fact that the momentum variable carries information about the previous values of the . However, since typically undergoes sudden modification when the -th component of the state variable vanishes or changes sign, the information contained in could be of little use, if not misleading. For this reason, in Algorithm 2 we set if the -th component of the state variable is null, or if it has been involved in a sign change. See, respectively, line 10 and line 17 of the accelerated subgradient method reported in Algorithm 2. Finally, in virtue of (3.4) and the remarks done above, we observe that a natural choice for the stepsize is , where is the Lipschitz constant of the gradient of the regular term .
Remark 4*.*
In line 31 of Algorithm 2 we have introduced the quantity . We recall that , where is convex and -regular, and . Using the same notations as in Algorithm 2, is defined as follows:
[TABLE]
for every . We observe that is well-defined for every component since, by construction, for every .
Remark 5*.*
We observe that the computation of the quantity at the line 35 requires an evaluation of the subdifferential of at the point . From a computational viewpoint, the demanding part is the evaluation of the gradient of the regular term, i.e., . However, if , then (line 42), and can be stored and re-used for the construction of at the subsequent iteration.
We can prove the following result on the decrease of the objective function , guaranteeing that, in any circumstance, Algorithm 2 is at least as good as Algorithm 1.
Proposition 3.1**.**
Let be a function such that for every , where and is a convex function such that is -Lipschitz continuous, with . Let us consider as the initial point, and let be the output produced by an iteration of Algorithm 2 and let be the output of an iteration of Algorithm 1 (see line 29 of Algorithm 2). Then, we have that .
Remark 6*.*
Under the same assumptions as Theorem 2.1, i.e., when is -strongly convex, from Proposition 3.1 it follows that Algorithm 2 achieves a linear convergence rate. Indeed, if we denote by the sequence generated by Algorithm 2 setting the step-size equal to the inverse of the Lipschitz constant of , then, if we apply Proposition 3.1 with , for every we have:
[TABLE]
where is the constant appearing in Theorem 2.1, and is the output of a single iteration of Algorithm 1 with starting point .
Proof.
Using the same notations as in Algorithm 2, we have that is obtained from with an iteration of Algorithm 1 (see line 19 and line 22 of Algorithm 2). If , then there is nothing to prove. On the other hand, owing to the if statement at lines 26-30, we have that for every . We further observe that holds in every case (see line 25 and line 29). Let us define
[TABLE]
and the set
[TABLE]
Then, we have that , and that the restrictions , where is a -regular and convex function that satisfies:
[TABLE]
Moreover, from (3.5) we read that . Since is convex, we have that
[TABLE]
and, recalling that and , it follows that the condition implies .
On the other hand, if , then we reset (see line 35), and . ∎
4. Numerical experiments
In this section we present some numerical experiments involving composite objective functions with -regularization. We tested Algorithm 1 and its accelerated version Algorithm 2 on objective functions of the form , where is convex and regular. We considered both the strongly convex and the non-strongly case. For each class of problems, we compared the performances of our methods with ISTA, i.e., the standard forward-backward thresholding algorithm for -regularized problems (see, e.g., [7]). In [4] an accelerated version of ISTA (called Fast ISTA, or FISTA) was proposed, and in [16] it was observed that the convergence rate of FISTA can be further improved by means of adaptive restarts. We use the restarted FISTA described in [16] as the benchmark for the experiments of this part. We also reported the performances of the conservative-restart algorithm introduced in [22]. The results are illustrated in Figure 2.
Quadratic function with -regularization
We considered a function of the form
[TABLE]
where is a symmetric positive definite matrix with eigenvalues sampled uniformly in the interval , and was generated with a Gaussian distribution . We set , and we sampled the starting point using . We fixed the dimension . We observe that the objective function is strongly convex, hence, in principle, it could be possible to consider optimization schemes designed for strongly convex problems. However, their efficiency relies on how sharp is the available estimate of the strong convexity constant. On the other hand, both restarted-FISTA and Algorithm 2 do not require this information. This is one of the features of restarted-FISTA highlighted in [16].
Quadratic regression with -regularization
We considered a sparse quadratic regression problem. We generated a sparse random vector whose components were non-zero with probability . These values were sampled using a uniform distribution over . We took a matrix whose singular values were uniformly sampled in , and we set , where represented a Gaussian noise distributed as . Finally, the objective function had the form
[TABLE]
with . We used and , and we sampled the component of the initial guess with . This problem is non-strongly convex, since the matrix has not full rank.
Logistic regression with -regularization
We considered a sparse logistic regression problem. We constructed with the following procedure: each component was zero with probability , and, if nonzero, its value was sampled using a standard normal . Then, we independently sampled the entries of using the distribution: for every , where are the rows of a matrix with independent components generated with . Supposing to know the matrix and the measurements , the sparse log-likelyhood maximization can be formulated as the problem of minimizing
[TABLE]
where we set . We used and , and we sampled the component of the initial guess with . This problem is convex but not strongly convex.
LogSumExp with -regularization
We considered the function defined as follows:
[TABLE]
where are the rows of the matrix , and . The entries of and were independently sampled using a Gaussian , as well as the components of the starting point. We set , and we used and . This is another example of non-strongly convex problem.
We briefly comment on the results of the experiments described above. We observe that the non-accelerated algorithms, i.e., Algorithm 1 and ISTA, have always very similar performances. Restarted FISTA is the most performing in the strongly convex case, while Algorithm 2 seems to be the most efficient with non-strongly convex objectives. If compared to the restart-conservative of [22], we observe that Algorithm 2 is much faster in the early phases of the minimization process. Finally, the classical subgradient method with diminishing step-size is the less performing scheme.
The fact that the decays achieved Algorithm 1 and ISTA are almost identical motivated us to construct an example where the difference in performances could be more apparent. We considered a two-dimensional function such that , and such that , for some . More precisely, we defined as
[TABLE]
and we set and . In this case, the correct individuation of the fact that the second component of the minimizer is null can be challenging. This is due to the identity , or, in other words, since the vector [math] does not lie in the relative interior of . In this scenario, in the case a crossing of the set occurs, we expect that Algorithm 1 might better decide whether the component should be set equal to [math]. We used as initial guess the point . We also considered a family of problems obtained by perturbing , and with Gaussian noise of standard deviations, respectively, equal to , and . The results are reported in Figure 3. We observe that Algorithm 1 achieves better performances than ISTA on the designed problem, and this advantage seems to be robust with respect to the perturbations introduced. Finally, despite using step-sizes that decay faster than in the previous experiments, the classical subgradient method exhibits evident oscillations, both in the original and in the noisy problem.
Conclusions
In this paper, we considered composite convex optimization problems with -penalization, and we formulated a subgradient algorithm with constant step-size. In the case of strongly convex objectives, we established a linear convergence result for the method. Using dynamical system considerations, we proposed an accelerated version of the subgradient algorithm, that, at each iteration, achieves a decay of the objective always greater or equal than the decay corresponding to a step of the non-accelerated subgradient method. We observed in numerical experiments that the inertial algorithm can effectively compete with one of the most performing schemes for this kind of problems, i.e., FISTA combined with an adaptive restart strategy.
For future work, it could be interesting to design subgradient algorithms for composite optimization involving a non-smooth term of the form . In this case, a challenging point consists in finding strategies for computing (or a suitable approximation) that could be practical for high-dimensional settings.
Acknowledgments
This paper is dedicated to the beloved memory of Prof. Piero Colli Franzone. A.S. acknowledges partial support from INdAM-GNAMPA. A.S. wants to thank two anonymous Referees for the helpful comments that contributed to improve the quality of the paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] H. Attouch, J. Peypouquet, P. Redont. Fast convex optimization via inertial dynamics with Hessian driven damping. Journal of Differential Equations , 261:5734–5783, 2016. doi: 10.1016/j.jde.2016.08.020
- 2[2] H. Attouch, Z. Chbani, J. Peypouquet, P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. , 168:123–175, 2018. doi: 10.1007/s 10107-016-0992-8
- 3[3] D. Bertsekas. Convex Optimization Algorithms. Athena Scientific, Nashua, 2015.
- 4[4] A. Beck, M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. , 2:183–202, 2009. doi: 10.1137/080716542
- 5[5] J. Bolte, T.P. Nguyen, J. Peypouquet, B.W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. , 165:471–507, 2017. doi: 10.1007/s 10107-016-1091-6
- 6[6] E. Candès, J.K. Romberg, T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. , 59: 1207–1223, 2008. doi: 10.1002/cpa.20124
- 7[7] P.L. Combettes, V.R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Model. Sim. , 4(4):1168–1200, 2005. doi: 10.1137/050626090
- 8[8] D. Davis, D. Drusvyatskiy, K.J. Mac Phee, C. Paquette. Subgradient Methods for Sharp Weakly Convex Functions. J. Optim. Theory Appl. , 179: 962–982, 2018. doi: 10.1007/s 10957-018-1372-8
