A subgradient method with constant step-size for $\ell_1$-composite   optimization

Alessandro Scagliotti; Piero Colli Franzone

arXiv:2302.12105·math.OC·November 20, 2023

A subgradient method with constant step-size for $\ell_1$-composite optimization

Alessandro Scagliotti, Piero Colli Franzone

PDF

Open Access

TL;DR

This paper introduces a subgradient method with a constant step-size for -regularized convex optimization, achieving linear convergence in strongly convex cases and demonstrating effectiveness through numerical tests.

Contribution

It proposes a novel subgradient method with constant step-size for -regularized problems and an accelerated version with proven linear convergence.

Findings

01

Linear convergence for strongly convex smooth terms

02

Effective performance on both strongly and non-strongly convex examples

03

Accelerated algorithm with adaptive restart strategy

Abstract

Subgradient methods are the natural extension to the non-smooth case of the classical gradient descent for regular convex optimization problems. However, in general, they are characterized by slow convergence rates, and they require decreasing step-sizes to converge. In this paper we propose a subgradient method with constant step-size for composite convex objectives with $ℓ_{1}$ -regularization. If the smooth term is strongly convex, we can establish a linear convergence result for the function values. This fact relies on an accurate choice of the element of the subdifferential used for the update, and on proper actions adopted when non-differentiability regions are crossed. Then, we propose an accelerated version of the algorithm, based on conservative inertial dynamics and on an adaptive restart strategy, that is guaranteed to achieve a linear convergence rate in the strongly convex…

Equations148

f (x) = g (x) + h (x),

f (x) = g (x) + h (x),

x^{k + 1} = x^{k} - h_{k} v^{k} k \geq 0,

x^{k + 1} = x^{k} - h_{k} v^{k} k \geq 0,

x^{k + 1} = x^{k} - ν_{k} \frac{v ^{k}}{∣ v ^{k} ∣ _{2}},

x^{k + 1} = x^{k} - ν_{k} \frac{v ^{k}}{∣ v ^{k} ∣ _{2}},

f (x) = g (x) + γ ∣ x ∣_{1} .

f (x) = g (x) + γ ∣ x ∣_{1} .

\partial f (x) := {y \in R^{n} ∣ f (z) \geq f (x) + ⟨ y, z - x ⟩, \forall z \in R^{n}} .

\partial f (x) := {y \in R^{n} ∣ f (z) \geq f (x) + ⟨ y, z - x ⟩, \forall z \in R^{n}} .

\partial^{-} f (x) := ar g min {∣ y ∣_{2} ∣ y \in \partial f (x)} .

\partial^{-} f (x) := ar g min {∣ y ∣_{2} ∣ y \in \partial f (x)} .

\partial^{-} f (x) := ar g min {∣ y ∣_{2}^{2} ∣ y \in \partial f (x)},

\partial^{-} f (x) := ar g min {∣ y ∣_{2}^{2} ∣ y \in \partial f (x)},

f (x) - f (x^{*}) \leq \frac{1}{2 μ} ∣ y ∣_{2}^{2},

f (x) - f (x^{*}) \leq \frac{1}{2 μ} ∣ y ∣_{2}^{2},

f (x) - f (x^{*}) \leq \frac{1}{2 μ} ∣ \partial^{-} f (x) ∣_{2}^{2} .

f (x) - f (x^{*}) \leq \frac{1}{2 μ} ∣ \partial^{-} f (x) ∣_{2}^{2} .

ψ (x) := f (x) - \frac{μ}{2} ∣ x - x^{*} ∣_{2}^{2} .

ψ (x) := f (x) - \frac{μ}{2} ∣ x - x^{*} ∣_{2}^{2} .

\partial ψ (x) = \partial f (x) - μ (x - x^{*}) .

\partial ψ (x) = \partial f (x) - μ (x - x^{*}) .

ψ (x^{*}) \geq ψ (x) + ⟨ y - μ (x - x^{*}), x^{*} - x ⟩ = f (x) + \frac{μ}{2} ∣ x - x^{*} ∣_{2}^{2} + ⟨ y, x^{*} - x ⟩ \geq f (x) - \frac{1}{2 μ} ∣ y ∣_{2}^{2},

ψ (x^{*}) \geq ψ (x) + ⟨ y - μ (x - x^{*}), x^{*} - x ⟩ = f (x) + \frac{μ}{2} ∣ x - x^{*} ∣_{2}^{2} + ⟨ y, x^{*} - x ⟩ \geq f (x) - \frac{1}{2 μ} ∣ y ∣_{2}^{2},

f (x) = g (x) + γ ∣ x ∣_{1},

f (x) = g (x) + γ ∣ x ∣_{1},

\partial f (x) = {\nabla g (x) + γ i = 1 \sum n ν_{i} e_{i} ∣ ν_{i} = sign (x_{i}) \mbox i f x_{i} \neq = 0, ν_{i} \in [- 1, 1] \mbox i f x_{i} = 0}

\partial f (x) = {\nabla g (x) + γ i = 1 \sum n ν_{i} e_{i} ∣ ν_{i} = sign (x_{i}) \mbox i f x_{i} \neq = 0, ν_{i} \in [- 1, 1] \mbox i f x_{i} = 0}

\partial_{i} f (x) = {{\partial_{i} g (x) + γ ν_{i} ∣ ν_{i} = sign (x_{i})} {\partial_{i} g (x) + γ ν_{i} ∣ ν_{i} \in [- 1, 1]} x_{i} \neq = 0, x_{i} = 0,

\partial_{i} f (x) = {{\partial_{i} g (x) + γ ν_{i} ∣ ν_{i} = sign (x_{i})} {\partial_{i} g (x) + γ ν_{i} ∣ ν_{i} \in [- 1, 1]} x_{i} \neq = 0, x_{i} = 0,

\partial_{i}^{-} f (x) = {\partial_{i} g (x) + γ sign (x_{i}) sign (\partial_{i} g (x)) max {∣ \partial_{i} g (x) ∣ - γ, 0} x_{i} \neq = 0, x_{i} = 0.

\partial_{i}^{-} f (x) = {\partial_{i} g (x) + γ sign (x_{i}) sign (\partial_{i} g (x)) max {∣ \partial_{i} g (x) ∣ - γ, 0} x_{i} \neq = 0, x_{i} = 0.

α_{x}^{+} α_{x}^{-} β_{x} := {i \in {1, \dots, n} ∣ x_{i} > 0}, := {i \in {1, \dots, n} ∣ x_{i} < 0}, := {i \in {1, \dots, n} ∣ x_{i} = 0} .

α_{x}^{+} α_{x}^{-} β_{x} := {i \in {1, \dots, n} ∣ x_{i} > 0}, := {i \in {1, \dots, n} ∣ x_{i} < 0}, := {i \in {1, \dots, n} ∣ x_{i} = 0} .

⎩ ⎨ ⎧ x_{i} + v_{i} \geq 0 x_{i} + v_{i} \leq 0 v_{i} = 0 v_{i} \geq 0 v_{i} \leq 0 \forall i \in α_{x}^{+}, \forall i \in α_{x}^{-}, \forall i \in β_{x} \mbox s . t . \partial_{i}^{-} f (x) = 0, \forall i \in β_{x} \mbox s . t . \partial_{i}^{-} f (x) < 0, \forall i \in β_{x} \mbox s . t . \partial_{i}^{-} f (x) > 0.

⎩ ⎨ ⎧ x_{i} + v_{i} \geq 0 x_{i} + v_{i} \leq 0 v_{i} = 0 v_{i} \geq 0 v_{i} \leq 0 \forall i \in α_{x}^{+}, \forall i \in α_{x}^{-}, \forall i \in β_{x} \mbox s . t . \partial_{i}^{-} f (x) = 0, \forall i \in β_{x} \mbox s . t . \partial_{i}^{-} f (x) < 0, \forall i \in β_{x} \mbox s . t . \partial_{i}^{-} f (x) > 0.

f (x + v) \leq f (x) + ⟨ \partial^{-} f (x), v ⟩ + \frac{1}{2} L ∣ v ∣_{2}^{2},

f (x + v) \leq f (x) + ⟨ \partial^{-} f (x), v ⟩ + \frac{1}{2} L ∣ v ∣_{2}^{2},

ϕ (x + v) \leq ϕ (x) + ⟨ \nabla ϕ (x), v ⟩ + \frac{1}{2} L ∣ v ∣_{2}^{2}

ϕ (x + v) \leq ϕ (x) + ⟨ \nabla ϕ (x), v ⟩ + \frac{1}{2} L ∣ v ∣_{2}^{2}

β_{x}^{+}

β_{x}^{+}

β_{x}^{-}

β_{x}^{0}

ζ^{+} := α_{x}^{+} \cup β_{x}^{+}, ζ^{-} := α_{x}^{-} \cup β_{x}^{-}, ζ^{0} := β_{x}^{0}

ζ^{+} := α_{x}^{+} \cup β_{x}^{+}, ζ^{-} := α_{x}^{-} \cup β_{x}^{-}, ζ^{0} := β_{x}^{0}

C_{ζ^{\pm, 0}} := {z \in R^{n} ∣ z_{ζ^{+}} \geq 0, z_{ζ^{-}} \leq 0, z_{ζ^{0}} = 0} .

C_{ζ^{\pm, 0}} := {z \in R^{n} ∣ z_{ζ^{+}} \geq 0, z_{ζ^{-}} \leq 0, z_{ζ^{0}} = 0} .

f^{aux} : z = (z_{ζ^{+}}, z_{ζ^{-}}, z_{ζ^{0}}) \mapsto g (z_{ζ^{+}}, z_{ζ^{-}}, 0_{ζ^{0}}) + γ i \in ζ^{+} \sum z_{i} - γ i \in ζ^{-} \sum z_{i},

f^{aux} : z = (z_{ζ^{+}}, z_{ζ^{-}}, z_{ζ^{0}}) \mapsto g (z_{ζ^{+}}, z_{ζ^{-}}, 0_{ζ^{0}}) + γ i \in ζ^{+} \sum z_{i} - γ i \in ζ^{-} \sum z_{i},

\nabla f^{aux} : z = (z_{ζ^{+}}, z_{ζ^{-}}, z_{ζ^{0}}) \mapsto \nabla g (z_{ζ^{+}}, z_{ζ^{-}}, 0_{ζ^{0}}) + γ i \in ζ^{+} \sum e_{i} - γ i \in ζ^{-} \sum e_{i} .

\nabla f^{aux} : z = (z_{ζ^{+}}, z_{ζ^{-}}, z_{ζ^{0}}) \mapsto \nabla g (z_{ζ^{+}}, z_{ζ^{-}}, 0_{ζ^{0}}) + γ i \in ζ^{+} \sum e_{i} - γ i \in ζ^{-} \sum e_{i} .

f ∣_{C_{ζ^{\pm, 0}}} \equiv f^{aux} ∣_{C_{ζ^{\pm, 0}}},

f ∣_{C_{ζ^{\pm, 0}}} \equiv f^{aux} ∣_{C_{ζ^{\pm, 0}}},

f (x + v) \leq f (x) + ⟨ \nabla f^{aux} (x), v ⟩ + \frac{1}{2} L ∣ v ∣_{2}^{2} .

f (x + v) \leq f (x) + ⟨ \nabla f^{aux} (x), v ⟩ + \frac{1}{2} L ∣ v ∣_{2}^{2} .

\frac{\partial}{\partial x _{i}} f^{aux} (x) v_{i} = \partial_{i}^{-} f (x) v_{i} i = 1, \dots, n .

\frac{\partial}{\partial x _{i}} f^{aux} (x) v_{i} = \partial_{i}^{-} f (x) v_{i} i = 1, \dots, n .

f (x^{k}) - f (x^{*}) \leq κ^{k} (f (x^{0}) - f (x^{*})),

f (x^{k}) - f (x^{*}) \leq κ^{k} (f (x^{0}) - f (x^{*})),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Advanced Optimization Algorithms Research

Full text

A subgradient method with constant step-size for $\ell_{1}$ -composite optimization

Alessandro Scagliotti and Piero Colli Franzone

Technical University of Munich (TUM) & Munich Center for Machine Learning (MCML), Germany

[email protected]

Dipartimento di Matematica, Università di Pavia, Italy

[email protected]

Abstract.

Subgradient methods are the natural extension to the non-smooth case of the classical gradient descent for regular convex optimization problems. However, in general, they are characterized by slow convergence rates, and they require decreasing step-sizes to converge. In this paper we propose a subgradient method with constant step-size for composite convex objectives with $\ell_{1}$ -regularization. If the smooth term is strongly convex, we can establish a linear convergence result for the function values. This fact relies on an accurate choice of the element of the subdifferential used for the update, and on proper actions adopted when non-differentiability regions are crossed. Then, we propose an accelerated version of the algorithm, based on conservative inertial dynamics and on an adaptive restart strategy, that is guaranteed to achieve a linear convergence rate in the strongly convex case. Finally, we test the performances of our algorithms on some strongly and non-strongly convex examples.

Keywords: convex optimization, $\ell_{1}$ -regularization, subgradient method, inertial acceleration, restart strategies.

Introduction

In this paper we deal with convex composite optimization, i.e., we consider objective functions $f:\mathbb{R}^{n}\to\mathbb{R}$ of the form

[TABLE]

where $g:\mathbb{R}^{n}\to\mathbb{R}$ is $C^{1}$ -regular with Lipschitz-continuous gradient, and $h:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ is a non-smooth convex function. We recall that the concept of composite function was introduced by Nesterov in [14], and it usually denotes the splitting (0.1) in the case that the non-regular term $h$ is simple. In this framework, possible examples of simple functions include, e.g., the indicator of a closed convex set, or the supremum of a finite family of linear functions. The problem of minimizing such composite functions can be effectively addressed by means of forward-backward methods (see, e.g., [7]), and their accelerated versions [4]. In this regard, we report the recent contribution [20], where it is considered an accelerated method that achieves linear convergence when $g,h$ in (0.1) are strongly convex.

The aim of this paper is to develop a convergent subgradient method with constant step-size for the minimization of particular instances of (0.1). The subgradient method was first introduced in [24] and, given an initial guess $x^{0}\in\mathbb{R}^{n}$ , the algorithm produces a sequence $(x^{k})_{k\geq 0}$ with update rule

[TABLE]

where $\mathfrak{v}^{k}\in\partial f(x^{k})$ , i.e., it is an element taken from the subdifferential of the objective at the point $x^{k}$ , and $h_{k}>0$ denotes the step-size. If we set $\nu_{k}=h_{k}|\mathfrak{v}^{k}|_{2}$ , we can equivalently rephrase (0.2) as

[TABLE]

where $\nu_{k}$ represents the step-length at the $k$ -th iteration. It is possible to deduce the convergence $\lim_{k\to\infty}f(x^{k})=f(x^{*})$ as soon as $(\nu_{k})_{k\geq 0}$ satisfies $\lim_{k\to\infty}\nu_{k}=0$ and $\sum_{k=1}^{\infty}\nu_{k}=\infty$ (see [25, Chapter 2]). In [19, Theorem 5.2] it is proposed a construction for $(\nu_{k})_{k\geq 1}$ that achieves $f(x^{k})-f(x^{*})=o(1/\sqrt{k})$ as $k\to\infty$ when the value $f(x^{*})$ is known a priori. We insist on the fact that, in the results mentioned above, the vector $\mathfrak{v}^{k}$ can be any element of $\partial f(x^{k})$ . If we now consider constant step-sizes, i.e., $h_{k}=h>0$ for every $k\geq 0$ , in general we cannot expect the convergence of the iterates of (0.2) to a minimizer. For instance, given the one-dimensional function $f:x\mapsto|x|$ , for every choice $h>0$ , if the initial guess $x_{0}\not\in\{rh:r\in\mathbb{Z}\}$ , then the sequence produced by (0.2) oscillates and it remains well-separated from [math]. From this example it is clear that, in order to work out a convergent subgradient method with constant step-size, it is crucial to identify the regions where the objective $f$ is non-differentiable, and to take proper actions when the sequence $(x^{k})_{k\geq 0}$ crosses them. Moreover, in our analysis a role of primary importance is played by the choice of the element $\mathfrak{v}^{k}\in\partial f(x^{k})$ used for the iteration.

Subgradient methods with constant step-size have already been considered in the convex optimization literature, and, typically, it is possible to prove that the iterates arrive to the sublevel set $\{x\in\mathbb{R}^{n}:f(x)\leq\inf f+c\}$ , where the quantity $c>0$ is related to the step-size $h>0$ . In a similar flavor, if the objective function is strongly convex, the sequence produced by the algorithm manages to reach a ball centered at the minimizer, whose radius depends on $h$ . For a presentations of these results, we refer the reader to [3, Section 3.2]. Moreover, under suitable assumptions on the growth of $f$ around the minimizer $x^{*}$ , it is possible to prove that the distance of the iterates to $x^{*}$ has a linear decay, up to a certain threshold (that, once again, is estimated in function of $h$ ). For further details, see [11, Theorem 1] and [8, Theorem 4.3]. Finally, we report the recent contribution of [12], where the authors study the stability of a subgradient method with constant step-size around local minimizers, when $f$ is non-smooth and non-convex. To the best of our knowledge, the one presented here are the first convergence results for a subgradient method with constant step-size.

In this paper, we devote our attention to the case where the non-regular term at the right-hand side of (0.1) consists in the $\ell_{1}$ -penalization, i.e., where we have $h(x)=\gamma|x|_{1}=\gamma\sum_{i=1}^{n}|x_{i}|$ with $\gamma>0$ , and

[TABLE]

This kind of problem is well-studied since the presence of the $\ell_{1}$ -norm induces sparsity in the minimizer, and for this reason such minimization tasks easily arise in real-world applications. For instance, we recall [6] for signal processing applications, [29] for imaging problems, and finally [9, 27] for the $\ell_{1}$ -regularized logistic regression, which is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing.

In our approach, we take advantage of the structure of the points where the objective $f$ is non-differentiable. We recall that, in the case of $\ell_{1}$ -penalization, such points coincide with the set $\bigcup_{i=1}^{n}\{x\in\mathbb{R}^{n}:x_{i}=0\}$ . Hence, at each iteration, if the current value $x^{k}$ has some null component, i.e., $x^{k}\in\bigcap_{i\in\beta_{x^{k}}}\{x\in\mathbb{R}^{n}:x_{i}=0\}$ for some $\beta_{x^{k}}\subset\{1,\ldots,n\}$ , we first decide which hyperplanes $\{x\in\mathbb{R}^{n}:x_{i}=0\}$ $i\in\beta_{x^{k}}$ we move parallel to. This choice is authomatically done by selecting for the update (0.2) the direction $\mathfrak{v}^{k}=\partial^{-}f(x^{k})\in\partial f(x^{k})$ , where $\partial^{-}f(x^{k})$ denotes the element of $\partial f(x^{k})$ with minimal Euclidean norm. The interesting situation occurs when some components strictly change sign when moving from $x^{k}$ to $x^{k}-h\partial^{-}f(x^{k})$ . In that case, we have to properly decide whether to allow (some of) these changes of sign, or to set the corresponding components equal to 0. We stress the fact that this phase is fundamental in order to avoid the oscillations that characterized the one-dimensional example reported above. For this method, described in Algorithm 1, we can establish a linear convergence result as soon as the regular function $g:\mathbb{R}^{n}\to\mathbb{R}$ appearing at the right-hand side of (0.1) is strongly convex. To show that, we make use of a non-smooth version of the Polyak-Lojasiewicz inequality (see, e.g., [5, 28]).

Then, in Section 3, we propose a momentum-based acceleration of Algorithm 1, inspired by the restarted-conservative algorithm introduced in [22]. In the smooth convex framework, the idea of introducing momentum to accelerate the convergence of the classical gradient method dates back to the 1960s, with the works of Polyak [17, 18]. These methods, often called heavy-ball, can be interpreted as discretization of a second order damped mechanical system, where the objective function plays the role of the potential energy. In [26] it was shown that also the celebrated Nesterov accelerated gradient method (see [13]) can be interpreted in this framework. This led to a renewed interest in the interplay between discrete-time optimization algorithms and continuous-time dynamical models. In this context, in the mechanical system, the classical linear and isotropic viscosity friction is often replaced by a more general dissipative term. In this regard, we recall the contributions [1, 2, 23]. From the discrete-time side, in [16] the authors empirically observed that adaptively resetting to [math] the momentum variable (i.e., the velocity) can further boost the convergence. Motivated by this fact, in [22] it was considered a conservative dynamical model (i.e., without any dissipative term in the dynamics), whose convergence completely relies on a proper restart scheme. In Algorithm 2 we propose for composite functions with $\ell_{1}$ -penalization a new version of the restarted-conservative algorithm that has been heuristically outlined in [22], and in Section 3 we show that the per-iteration decay achieved by Algorithm 2 is always larger or equal than in Algorithm 1.

Finally, in Section 4 we test our algorithms in strongly and non-strongly convex optimization problems with $\ell_{1}$ -regularization.

1. Preliminary results

In this section we establish some auxiliary results that will be used later. Given a convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ , for every $x\in\mathbb{R}^{n}$ we denote with $\partial f(x)\subset\mathbb{R}^{n}$ the subdifferential of $f$ at the point $x$ . We recall that

[TABLE]

Definition 1.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function. For every $x\in\mathbb{R}^{n}$ , we define the vector $\partial^{-}f(x)\in\mathbb{R}^{n}$ as follows

[TABLE]

*Remark 1**.*

We observe that Definition 1 is always well-posed. Indeed, for every convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ , for every $x\in\mathbb{R}^{n}$ the subdifferential $\partial f(x)$ is a non-empty, compact and convex subset of $\mathbb{R}^{n}$ . Namely, since we do not allow $f$ to assume the value $+\infty$ , this fact descends directly from [15, Theorem 3.1.15]. Moreover, we can equivalently rephrase (1.1) as

[TABLE]

i.e., as a positive-definite quadratic programming problem on a convex domain. Hence, we deduce that $\partial^{-}f(x)$ is well-defined, and that it consists of a single element. Considering this last fact, in this paper we understand $\partial^{-}f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ as a vector-valued operator, rather than a set-valued mapping.

We report below a non-smooth version of the celebrated Polyak- $\L$ ojasiewicz inequality. We refer the reader to [17] and [15, Theorem 2.1.10] for the classical statement in the smooth case, and to [5, Section 2.3] and [28, Section 2.2] for the extension to non-differentiable functions.

Lemma 1.1.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a $\mu$ -strongly convex function, and let $x^{*}$ be its minimizer. Then, for every $x\in\mathbb{R}^{n}$ and for every element of the subdifferential $y\in\partial f(x)$ the following inequality holds:

[TABLE]

and, in particular, we have

[TABLE]

Proof.

Let us introduce the auxiliary function $\psi:\mathbb{R}^{n}\to\mathbb{R}$ defined as

[TABLE]

The fact that $f$ is $\mu$ -strongly convex guarantees that $\psi$ is still a convex function. Moreover, for every $x\in\mathbb{R}^{n}$ we have that

[TABLE]

This follows immediately from the fact that $f(x)=\psi(x)+\frac{\mu}{2}|x-x^{*}|_{2}^{2}$ for every $x\in\mathbb{R}^{n}$ , and from the sum rule for subdifferentials (see, e.g., [21, Theorem 23.8]), i.e., $\partial f(x)=\partial\psi(x)+\mu(x-x^{*})$ . For every $x\in\mathbb{R}^{n}$ and for every $y\in\partial f(x)$ we compute

[TABLE]

where we used (1.3) and the subdifferential inequality for the convex function $\psi$ . Recalling that $\psi(x^{*})=f(x^{*})$ , from (1.4) we directly deduce the thesis. ∎

We now introduce the class of functions that will be the main object of our investigation. We consider a composite objective (see [14]) $f:\mathbb{R}^{n}\to\mathbb{R}$ of the form

[TABLE]

where $g:\mathbb{R}^{n}\to\mathbb{R}$ is a $C^{1}$ -regular convex function with Lipschitz-continuous gradient of constant $L>0$ , and where $\gamma>0$ is a positive constant. We recall that $|x|_{1}:=\sum_{i=1}^{n}|x_{i}|$ for every $x\in\mathbb{R}^{n}$ . We observe that

[TABLE]

for every $x\in\mathbb{R}^{n}$ , where $e_{i}$ is the $i$ -th element of the standard basis of $\mathbb{R}^{n}$ . If we define $\partial_{i}f(x):=\langle e_{i},\partial f(x)\rangle$ , we have that

[TABLE]

for every $i=1,\ldots,n$ , where $\partial_{i}g(x):=\frac{\partial}{\partial x_{i}}g(x)$ denotes the usual partial derivative of the regular term $g:\mathbb{R}^{n}\to\mathbb{R}$ at the right-hand side of (1.5). From (1.7) we read that the $i$ -th component of $\partial f(x)$ is affected only by $\nu_{i}$ . Therefore, in order to compute the operator $\partial^{-}f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ introduced in Definition 1, we can find separately the element of minimal absolute value of $\partial_{i}f(x)$ for $i=1,\ldots,n$ . We use $\partial_{i}^{-}f(x)$ to access the $i$ -th component of $\partial^{-}f(x)$ . In particular, for every $x\in\mathbb{R}^{n}$ we have that

[TABLE]

Definition 2.

Given $x=(x_{1},\ldots,x_{n})\in\mathbb{R}^{n}$ , we define the following partition of the components $\{1,\ldots,n\}$ induced by the point $x$ :

[TABLE]

From now on, when making use of a partition $\alpha^{1},\ldots,\alpha^{k}$ of the indexes of the components $\{1,\ldots,n\}$ , for every $z=(z_{1},\ldots,z_{n})\in\mathbb{R}^{n}$ we write $z=(z_{\alpha^{1}},\ldots,z_{\alpha^{k}})$ , where $z_{\alpha^{j}}\in\mathbb{R}^{|\alpha^{j}|}$ is the vector obtained by extracting from $z$ the components that belong to $\alpha^{j}$ , i.e., $z_{\alpha^{j}}=(z_{i})_{i\in\alpha^{j}}$ for every $j=1,\ldots,k$ . The next technical result is the key-lemma of the convergence proof of Section 2.

Lemma 1.2.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function of the form (1.5). Given $x\in\mathbb{R}^{n}$ , let $\alpha^{+}_{x},\alpha^{-}_{x},\beta_{x}$ be the partition of $\{1,\ldots,n\}$ corresponding to the point $x$ and prescribed by (1.9). Let us consider a vector $v=(v_{1},\ldots,v_{n})\in\mathbb{R}^{n}$ such that

[TABLE]

Then the following inequality holds:

[TABLE]

where $L>0$ is the Lipschitz constant of the regular term at the right-hand side of (1.5), and $\partial^{-}f(x)$ is defined as in Definition 1.

*Remark 2**.*

We recall that, in the case of a regular convex function $\phi:\mathbb{R}^{n}\to\mathbb{R}$ with $L$ -Lipschitz continuous gradient, we have

[TABLE]

for every $x,v\in\mathbb{R}^{n}$ (see, e.g., [15, Theorem 2.1.5]). The crucial fact for the proof of Lemma 1.2 is that, when $v$ satisfies the conditions (1.10), the segment $\overrightarrow{xx^{\prime}}$ lies in a region where the restriction of the objective $f$ is regular, where we set $x^{\prime}:=x+v$ . Lemma 1.2 will be used to prove that, along proper directions, the objective function $f$ is decreasing.

Proof.

Before proceeding, we introduce another partition of the set of indexes $\beta_{x}$ :

[TABLE]

and we define

[TABLE]

where $\alpha^{+}_{x},\alpha^{-}_{x},\beta_{x}$ are set accordingly to (1.9). If we consider the segment $t\mapsto\eta(t)=x+tv$ for $t\in[0,1]$ , it turns out that $\eta(t)\in C_{\zeta^{\pm,0}}$ for every $t\in[0,1]$ , where

[TABLE]

Let us define the auxiliary function $f^{\mathrm{aux}}:\mathbb{R}^{n}\to\mathbb{R}$ as

[TABLE]

where $g:\mathbb{R}^{n}\to\mathbb{R}$ is the smooth term at the right-hand side of (1.5). From the definition of $f^{\mathrm{aux}}$ , it follows that

[TABLE]

We observe that the function $f^{\mathrm{aux}}:\mathbb{R}^{n}\to\mathbb{R}$ is as regular as $g$ , i.e., it is of class $C^{1}$ with $L$ -Lipschitz continuous gradient. Indeed, the first term at the right hand-side of (1.14) is obtained as the composition $\nabla g\circ\Pi_{\zeta^{0}}$ , where $\Pi_{\zeta^{0}}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is the linear ( $1$ -Lipschitz) orthogonal projection onto the subspace $\{z\in\mathbb{R}^{n}\mid z_{\zeta^{0}}=0\}\subset\mathbb{R}^{n}$ . Moreover, the last terms at the right hand-side of (1.14) are constant. Therefore, using the identity

[TABLE]

if we apply the estimate (1.12) to $f^{\mathrm{aux}}$ , we deduce that

[TABLE]

Therefore, the thesis follows if we show that the following equalities hold:

[TABLE]

Using the partition of the components $\{1,\ldots,n\}$ provided by the families of indexes $\alpha^{+}_{x}$ , $\alpha^{-}_{x}$ , $\beta^{+}_{x}$ , $\beta^{-}_{x}$ , and $\beta^{0}_{x}$ , we have the following possibilities:

•

If $i\in\alpha^{+}_{x}$ , in virtue of (1.8) and (1.13), we obtain $\partial_{i}^{-}f(x)=\frac{\partial}{\partial x_{i}}g(x)+\gamma=\frac{\partial}{\partial x_{i}}f^{\mathrm{aux}}(x)$ .

•

The case $i\in\alpha^{-}_{x}$ is analogous to $i\in\alpha^{+}_{x}$ .

•

If $i\in\beta^{+}_{x}$ , then $x_{i}=0$ and $v_{i}>0$ , and, in virtue of (1.10), we deduce that $\partial_{i}^{-}f(x)<0$ . In particular, using again (1.8), this implies that $\partial_{i}^{-}f(x)=\frac{\partial}{\partial x_{i}}g(x)+\gamma$ . On the other hand, recalling the expression of $f^{\mathrm{aux}}$ in (1.13) and the inclusion $\beta^{+}_{x}\subset\zeta^{+}$ , we finally deduce $\frac{\partial}{\partial x_{i}}f^{\mathrm{aux}}(x)=\partial_{i}g(x)+\gamma$ .

•

The case $i\in\beta^{-}_{x}$ is analogous to $i\in\beta^{+}_{x}$ .

•

If $i\in\beta^{0}_{x}$ , then $v_{i}=0$ , and we immediately obtain $\partial_{i}^{-}f(x)v_{i}=0=\frac{\partial}{\partial x_{i}}f^{\mathrm{aux}}(x)v_{i}$ .

This argument shows that (1.15) is true, and it concludes the proof. ∎

2. Subgradient method and convergence analysis

In this section we propose a subgradient method with constant step-size for the numerical minimization of a convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ with the composite structure reported in (1.5). We insist on the fact that the analysis presented here holds only when the non-smooth term at the right-hand side of (1.5) is a $\ell_{1}$ -penalization.

Before introducing formally the algorithm, we provide some insights that have guided us towards its construction. Let $\bar{x}\in\mathbb{R}^{n}$ be the current guess for the minimizer of $f$ . We want to find a suitable direction in the subdifferential $\mathfrak{v}\in\partial f(\bar{x})$ such that $f(\bar{x}-h\mathfrak{v})\leq f(\bar{x})$ , where $h>0$ represents a constant step-size. In order to accomplish this, a natural choice consists in setting $\mathfrak{v}=\partial^{-}f(\bar{x})$ , where $\partial^{-}f(\bar{x})$ is defined as in (1.1). To see this, we first observe that, in virtue of the particular structure of $\partial f$ reported in (1.7), we can choose separately the components $\mathfrak{v}_{1},\ldots,\mathfrak{v}_{n}$ of the direction of the movement. If $\bar{x}_{i}\neq 0$ , then $\partial_{i}f(\bar{x})$ consists of a single element, hence the only possible choice is $\mathfrak{v}_{i}=\partial^{-}_{i}f(\bar{x})$ . If $\bar{x}_{i}=0$ and $\partial^{-}_{i}f(\bar{x})=0$ , then the convex application $t\mapsto f(\bar{x}+te_{i})$ attains the minimum at $t=0$ . Hence, any choice $\mathfrak{v}_{i}\in\partial_{i}f(\bar{x})$ with $\mathfrak{v}_{i}\neq 0$ would give $f(\bar{x}-h\mathfrak{v}_{i}e_{i})\geq f(\bar{x})$ , resulting in an increase of the objective function. For this reason, it is convenient to set $\mathfrak{v}_{i}=\partial^{-}_{i}f(\bar{x})=0$ , and to move tangentially to the non-differentiability region $\{x\in\mathbb{R}^{n}\mid x_{i}=0\}$ . On the other hand, if $\bar{x}_{i}=0$ and, e.g., $\partial^{-}_{i}f(\bar{x})>0$ , then $\partial_{i}f(\bar{x})\subset(0,+\infty)$ , and for every choice of $\mathfrak{v}_{i}\in\partial_{i}f(\bar{x})$ , we have that $\bar{x}_{i}-h\mathfrak{v}_{i}=-h\mathfrak{v}_{i}<0$ . However, observing that $\lim_{h\to 0^{+}}(f(\bar{x}+he_{i})-f(\bar{x}))/h=\partial^{-}_{i}f(\bar{x})$ , it looks natural to set once again $\mathfrak{v}_{i}=\partial^{-}_{i}f(\bar{x})$ .

Besides the selection of the direction $\mathfrak{v}=\partial^{-}f(\bar{x})$ , the second crucial aspect is whether some sign changes occur in the coordinates when moving from $\bar{x}$ to $\bar{x}-h\partial^{-}f(\bar{x})$ . If not, the situation is pretty analogous to a step of the classical gradient descent in the smooth framework. On the other hand, if there is, e.g., a positive component $\bar{x}_{i}$ that becomes negative, then we should carefully decide if the barrier $\{x\in\mathbb{R}^{n}\mid x_{i}=0\}$ should be crossed, or not. This is a key-point, in order to avoid the oscillations that characterized the simple example in the Introduction. In this case, we first set to [math] the components involved in a sign change, and for these components we re-evaluate $\partial^{-}f$ . Finally, using this additional information, we complete the step, as depicted in Figure 1. The implementation of the method is described in Algorithm 1.

We now establish the linear convergence result for Algorithm 1 in the case of strongly convex objective.

Theorem 2.1.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a function such that $f(x)=g(x)+\gamma|x|_{1}$ for every $x\in\mathbb{R}^{n}$ , where $\gamma>0$ and $g:\mathbb{R}^{n}\to\mathbb{R}$ is $C^{1}$ -regular. We further assume that there exist constants $L>\mu>0$ such that $g$ is $\mu$ -strongly convex and $\nabla g$ is $L$ -Lipschitz continuous. Let $(x^{k})_{k\geq 0}$ be the sequence generated by Algorithm 1. Then, there exists $\kappa=\kappa(L,\mu)\in(0,1)$ such that

[TABLE]

where $x^{*}\in\mathbb{R}^{n}$ denotes the unique minimizer of $f$ , and where we set the step-size $h=\frac{1}{L}$ .

Proof.

We follow the procedure described in Algorithm 1. We prove that each iteration leads to a linear decrease of the value of the objective function. The first stage of each step is based on the following update:

[TABLE]

where $h>0$ represents the step-size of the sub-gradient method. We distinguish two possible scenarios, corresponding to the if-else statement at the lines 5 and 7 of Algorithm 1.

Case 1. We have that

[TABLE]

i.e., none of the components of $x^{k}$ and of $x^{\mathrm{temp}}$ changes sign, in the sense that from strictly positive it becomes strictly negative, or vice-versa. If we set $v:=-h\partial^{-}f(x^{k})$ , we observe that the hypotheses of Lemma 1.2 are met for the point $x^{k}$ and the vector $v$ . Indeed, using the partition introduced in (1.9) and induced by the point $x^{k}$ , from (2.3) it follows that $i\in\alpha^{+}_{x^{k}}$ implies $x^{k}_{i}+v_{i}\geq 0$ . A similar argument holds for $i\in\alpha^{-}_{x^{k}}$ . Finally, if $i\in\beta_{x^{k}}$ , then $v_{i}$ satisfies (1.10) by construction. Therefore, from (1.11) we deduce that

[TABLE]

Moreover, if $h\leq\frac{2}{L}$ , in virtue of Lemma 1.1, we obtain that

[TABLE]

In this case, we assign $x^{k+1}:=x^{\mathrm{temp}}$ and, choosing $h=\frac{1}{L}$ in order to minimize the right-hand side of the previous inequality, we get

[TABLE]

Case 2. Recalling the definition of $x^{\mathrm{temp}}$ in (2.2), we are in the second scenario when

[TABLE]

i.e., there is at least one component that strictly changes sign. Before proceeding, we introduce the following partition of the components:

[TABLE]

and we define the following intermediate points:

[TABLE]

and

[TABLE]

where

[TABLE]

We observe that (2.7) corresponds to the assignments of lines 9-10 in Algorithm 1, while (2.9) incorporates lines 11-12. Finally, $x^{\prime\prime}$ is defined in (2.8) accordingly to line 13. We insist on the fact that in the update (2.8) the vector $v^{\prime\prime}$ is computed by re-evaluating $\partial^{-}_{\xi^{0}_{x^{k}}}f$ at the point $x^{\prime}$ . This is because $\partial^{-}_{\xi^{0}_{x^{k}}}f$ may exhibit sudden changes when considering the points $x^{k}$ and $x^{\prime}$ . In this regard, our construction guarantees that we employ the most trustworthy values for the choice of the decrease direction $v^{\prime\prime}$ . We point out that, if $x^{k}_{i}=0$ , then $i\in\xi^{0}_{x^{k}}$ . Moreover, we remark that if $i\in\xi^{0}_{x^{k}}$ and $\partial_{i}f(x^{k})=0$ , then we have necessarily that $x^{k}_{i}=0$ . Indeed, in this case, from (2.2) and $\partial_{i}f(x^{k})=0$ it follows that $x^{\mathrm{temp}}_{i}=x_{i}^{k}$ , while $i\in\xi^{0}_{x^{k}}$ gives $\mathrm{sign}(x_{i}^{k}\cdot x_{i}^{k})\leq 0$ , resulting in $x_{i}^{k}=0$ .

Phase (1). From (2.7), we immediately observe that

[TABLE]

with

[TABLE]

and where, for every $i\in\xi^{0}_{x^{k}}$ , we set

[TABLE]

We first notice that $\eta_{i}\in[0,1]$ . Indeed, assuming that $\partial_{i}^{-}f(x^{k})\neq 0$ (otherwise there is nothing to prove), since $i\in\xi^{0}_{x^{k}}$ , recalling (2.6) and (2.2), we have

[TABLE]

which in turn gives $x_{i}^{k}\partial^{-}_{i}f(x^{k})\geq 0$ and, as a matter of fact, $\eta_{i}\geq 0$ . On the other hand, in order to show that $\eta_{i}\leq 1$ , we assume without loss of generality that $x^{k}_{i}\neq 0$ . Then, using again (2.11), it follows that

[TABLE]

that yields $\eta_{i}\leq 1$ . Therefore, we conclude that

[TABLE]

Finally, from (2.5) we deduce that there exists at least one index $\hat{i}\in\xi^{0}_{x^{k}}$ such that $\eta_{\hat{i}}>0$ .

Using the partition $\alpha^{+}_{x^{k}},\alpha^{-}_{x^{k}},\beta_{x^{k}}$ of $\{1,\ldots,n\}$ induced by the point $x^{k}$ and prescribed by (1.9), we obtain that the following conditions are satisfied:

•

If $i\in\alpha^{+}_{x^{k}}$ , then either $i\in\xi^{+}_{x^{k}}$ or $i\in\xi^{0}_{x^{k}}$ . In the first case, $x^{\prime}_{i}=x^{k}_{i}>0$ , then $v^{\prime}_{i}=0$ . In the second, $x^{\prime}_{i}=0=x^{k}_{i}+v^{\prime}_{i}$ . Hence, in any case, $x^{k}_{i}+v^{\prime}_{i}\geq 0$ .

•

If $i\in\alpha^{-}_{x^{k}}$ , then an analogous reasoning as before yields $x^{k}_{i}+v^{\prime}_{i}\leq 0$ .

•

If $i\in\beta_{x^{k}}$ , then $x^{k}_{i}=0$ . Hence, $i\in\xi^{0}_{x^{k}}$ , and $x^{\prime}_{i}=0$ . Therefore, $v^{\prime}_{i}=0$ .

The previous argument proves that the vector $v^{\prime}$ introduced in (2.10) satisfies the assumptions of Lemma 1.2 at the point $x^{k}$ . Thus, we deduce that

[TABLE]

If we set $\bar{\eta}:=\max\{\eta_{i}\mid i\in\xi^{0}_{x^{k}}\}$ , we observe that (2.13) implies that $f(x^{\prime})\leq f(x^{k})$ whenever $h\in\left[0,\frac{2}{L\bar{\eta}}\right]$ . We stress the fact that the condition (2.5) that characterizes the present scenario guarantees that $\bar{\eta}>0$ .

Phase (2). We now investigate the update described in (2.8)-(2.9). Let $\alpha^{+}_{x^{\prime}},\alpha^{-}_{x^{\prime}}$ and $\beta_{x^{\prime}}$ be the partition of the components $\{1,\ldots,n\}$ induced by the point $x^{\prime}$ and prescribed by (1.9). Recalling (2.6) and the definition of $x^{\prime}$ in (2.7), we observe that $\alpha^{+}_{x^{\prime}}=\xi^{+}_{x^{k}}$ , $\alpha^{-}_{x^{\prime}}=\xi^{-}_{x^{k}}$ and $\beta_{x^{\prime}}=\xi^{0}_{x^{k}}$ . Hence, since $x^{\prime}_{i}=x^{k}_{i}\neq 0$ for every $i\in\alpha^{+}_{x^{\prime}}\cup\alpha^{-}_{x^{\prime}}$ , from (1.8) it descends that that

[TABLE]

which, in virtue of (2.9), yields

[TABLE]

Moreover, using (2.9), (2.2) and (2.6), we deduce that

[TABLE]

On the other hand, from (2.9) and recalling that $\beta_{x^{\prime}}=\xi^{0}_{x^{k}}$ , we have that

[TABLE]

By combining (2.15) and (2.16), we obtain that the hypotheses of Lemma 1.2 are met when considering the point $x^{\prime}$ and the direction $v^{\prime\prime}$ . Hence, it follows that

[TABLE]

On the other hand, recalling (2.9) and (2.14), we have that

[TABLE]

where we used the Lipschitz-continuity of $\nabla g$ , (2.10) and the fact that $x^{\prime}-x^{k}=v^{\prime}$ . If we set $h=\frac{1}{L}$ in (2.17), owing to (2.18) we deduce that

[TABLE]

Moreover, by combining the last inequality with (2.13) (using again $h=\frac{1}{L}$ ), we obtain that

[TABLE]

where we used (2.12) in the last passage. In virtue of Lemma 1.1, from (2.19) we deduce that

[TABLE]

We now distinguish two possibilities, corresponding to the if-else statement at lines 14 and 16 of Algorithm 1.

•

If $f(x^{\prime})<f(x^{\prime\prime})$ , then we set $x^{k+1}:=x^{\prime}$ .

•

If $f(x^{\prime\prime})\leq f(x^{\prime})$ , then we set $x^{k+1}:=x^{\prime\prime}$ .

In any case, from (2.20) we obtain

[TABLE]

Finally, in virtue of (2.4) and (2.21), if we set

[TABLE]

we deduce the thesis. ∎

*Remark 3**.*

The hypothesis of the strong convexity of the smooth function $g:\mathbb{R}^{n}\to\mathbb{R}$ in Theorem 2.1 can be slightly relaxed by requiring that $g:\mathbb{R}^{n}\to\mathbb{R}$ is convex, that the objective $f:\mathbb{R}^{n}\to\mathbb{R}$ adimits a minimizer $x^{*}$ and that there exists a constant $\mu>0$ such that $f$ satisfies the inequality (1.2) for every $x\in\mathbb{R}^{n}$ . Indeed, in the proof of Theorem 2.1 we only employ (1.2), and we do not use the strong convexity assumption. On the other hand, the assumption of convexity for $g$ is needed for the notion of subgradient considered in this paper.

3. Accelerated subgradient method

In this section we propose a momentum-based acceleration of Algorithm 1 for an objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ with the $\ell_{1}$ -composite structure introduced in (1.5). As observed in the Introduction, in the smooth-objective framework it is possible to design minimization schemes with momentum by discretizing second order ODEs of the form:

[TABLE]

where $V:\mathbb{R}^{n}\to\mathbb{R}$ represents the objective function, and $A(x,t)\in\mathbb{R}^{n\times n}$ is a positive semi-definite matrix that tunes the generalized viscosity friction. In [16] it was noticed that adaptive restart strategies can further accelerate the convergence to the minimizer, since they are capable of eliminating the oscillations typical of under-damped mechanical systems. The term adaptive restart denotes a procedure that resets to [math] the momentum/velocity variable (i.e., $p$ in (3.1)), as soon as a suitable condition is satisfied. In [22] it was considered a conservative dynamics by dropping the viscosity term, i.e, choosing $A(x,t)\equiv 0$ in (3.1). Then, using the symplectic Euler scheme (see, e.g., [10]) to discretize the system, it was proposed the following conservative algorithm:

[TABLE]

where $h_{m}>0$ represents the discretization step-size. In the case of a regular and convex objective $V$ , the conservative scheme (3.2) achieves at each iteration a decrease of the function $V$ greater or equal than the classical gradient descent. This fact relies on the following restart strategy: “reset $p_{k}=0$ whenever $\langle\nabla V(x^{k+1}),p^{k}\rangle>0$ ”. In [22] it was also investigated a heuristic extension of (3.2) to the case of a non-smooth objective $f:\mathbb{R}^{n}\to\mathbb{R}$ with $\ell_{1}$ -composite structure, where $\partial^{-}f(x^{k})$ was used in (3.2) in place of $\nabla V(x^{k})$ , i.e.,

[TABLE]

In this section, taking advantage of the observations done in Section 2 for the non-accelerated subgradient method, we propose a variant of the algorithm described in [22, Algorithm 4]. The main differences concern the way we manage the changes of sign in the components, and the condition for the reset of the momentum variable. Indeed, from (3.3) we deduce that

[TABLE]

where we set $h=h_{m}^{2}$ . Therefore, it is natural to divide every step of the accelerated algorithm into two phases:

•

$q\leftarrow x^{k}-h\partial^{-}f(x^{k})$ (subgradient phase). If sign changes in the components occur, we adopt the same procedures as in Algorithm 1.

•

$q^{\prime}\leftarrow q+\sqrt{h}p^{k}$ (momentum phase). Also in this phase, we have particular care of sign changes of the components.

Moreover, we use the general principle that “in the momentum phase we do not modify null components”. This is motivated by the fact that the momentum variable carries information about the previous values of the $\partial^{-}f$ . However, since $\partial_{i}^{-}f$ typically undergoes sudden modification when the $i$ -th component of the state variable $x^{k}$ vanishes or changes sign, the information contained in $p^{k}_{i}$ could be of little use, if not misleading. For this reason, in Algorithm 2 we set $p_{i}^{k}=0$ if the $i$ -th component of the state variable is null, or if it has been involved in a sign change. See, respectively, line 10 and line 17 of the accelerated subgradient method reported in Algorithm 2. Finally, in virtue of (3.4) and the remarks done above, we observe that a natural choice for the stepsize is $h=1/L$ , where $L$ is the Lipschitz constant of the gradient of the regular term $g:\mathbb{R}^{m}\to\mathbb{R}$ .

*Remark 4**.*

In line 31 of Algorithm 2 we have introduced the quantity $\widetilde{\partial}f(q^{\prime})$ . We recall that $f(x)=g(x)+\gamma|x|_{1}$ , where $g$ is convex and $C^{1}$ -regular, and $\gamma>0$ . Using the same notations as in Algorithm 2, $\widetilde{\partial}f(q^{\prime})=(\widetilde{\partial}_{1}f(q^{\prime}),\ldots,\widetilde{\partial}_{n}f(q^{\prime}))$ is defined as follows:

[TABLE]

for every $i=1,\ldots,n$ . We observe that $\widetilde{\partial}f(q^{\prime})$ is well-defined for every component since, by construction, $\mathrm{sign}(q_{i}\cdot q^{\prime}_{i})\geq 0$ for every $i=1,\ldots,n$ .

*Remark 5**.*

We observe that the computation of the quantity $r$ at the line 35 requires an evaluation of the subdifferential of $f$ at the point $q^{\prime}$ . From a computational viewpoint, the demanding part is the evaluation of the gradient of the regular term, i.e., $\nabla g(q^{\prime})$ . However, if $r\leq 0$ , then $x\leftarrow q^{\prime}$ (line 42), and $\nabla g(q^{\prime})$ can be stored and re-used for the construction of $\partial^{-}f(x)$ at the subsequent iteration.

We can prove the following result on the decrease of the objective function $f$ , guaranteeing that, in any circumstance, Algorithm 2 is at least as good as Algorithm 1.

Proposition 3.1.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a function such that $f(x)=g(x)+\gamma|x|_{1}$ for every $x\in\mathbb{R}^{n}$ , where $\gamma>0$ and $g:\mathbb{R}^{n}\to\mathbb{R}$ is a $C^{1}$ convex function such that $\nabla g$ is $L$ -Lipschitz continuous, with $L>0$ . Let us consider $q^{0}\in\mathbb{R}^{n}$ as the initial point, and let $q^{\prime}$ be the output produced by an iteration of Algorithm 2 and let $q$ be the output of an iteration of Algorithm 1 (see line 29 of Algorithm 2). Then, we have that $f(q^{\prime})\leq f(q)$ .

*Remark 6**.*

Under the same assumptions as Theorem 2.1, i.e., when $g:\mathbb{R}^{n}\to\mathbb{R}$ is $\mu$ -strongly convex, from Proposition 3.1 it follows that Algorithm 2 achieves a linear convergence rate. Indeed, if we denote by $(x^{k})_{k\geq 0}$ the sequence generated by Algorithm 2 setting the step-size $h$ equal to the inverse of the Lipschitz constant of $\nabla g$ , then, if we apply Proposition 3.1 with $q^{0}=x^{k}$ , for every $k\geq 0$ we have:

[TABLE]

where $\kappa\in(0,1)$ is the constant appearing in Theorem 2.1, and $q\in\mathbb{R}^{n}$ is the output of a single iteration of Algorithm 1 with starting point $x^{k}$ .

Proof.

Using the same notations as in Algorithm 2, we have that $q$ is obtained from $q^{0}$ with an iteration of Algorithm 1 (see line 19 and line 22 of Algorithm 2). If $p=0$ , then there is nothing to prove. On the other hand, owing to the if statement at lines 26-30, we have that $\mathrm{sign}(q^{\prime}_{i}\cdot q_{i})\geq 0$ for every $i=1,\ldots,n$ . We further observe that $q^{\prime}=q+\sqrt{h}p$ holds in every case (see line 25 and line 29). Let us define

[TABLE]

and the set

[TABLE]

Then, we have that $q,q^{\prime}\in Z_{\xi^{\pm},\xi^{0}}$ , and that the restrictions $f|_{Z_{\xi^{\pm},\xi^{0}}}\equiv\tilde{f}|_{Z_{\xi^{\pm},\xi^{0}}}$ , where $\tilde{f}:\mathbb{R}^{n}\to\mathbb{R}$ is a $C^{1}$ -regular and convex function that satisfies:

[TABLE]

Moreover, from (3.5) we read that $\nabla\tilde{f}(q^{\prime})=\widetilde{\partial}f(q^{\prime})$ . Since $\tilde{f}$ is convex, we have that

[TABLE]

and, recalling that $f(q)=\tilde{f}(q)$ and $f(q^{\prime})=\tilde{f}(q^{\prime})$ , it follows that the condition $\langle\nabla\tilde{f}(q^{\prime}),p\rangle\leq 0$ implies $f(q^{\prime})\leq f(q)$ .

On the other hand, if $\langle\nabla\tilde{f}(q^{\prime}),p\rangle>0$ , then we reset $q^{\prime}=q$ (see line 35), and $f(q^{\prime})=f(q)$ . ∎

4. Numerical experiments

In this section we present some numerical experiments involving composite objective functions with $\ell_{1}$ -regularization. We tested Algorithm 1 and its accelerated version Algorithm 2 on objective functions of the form $f(x)=g(x)+\gamma|x|_{1}$ , where $g:\mathbb{R}^{n}\to\mathbb{R}$ is convex and regular. We considered both the strongly convex and the non-strongly case. For each class of problems, we compared the performances of our methods with ISTA, i.e., the standard forward-backward thresholding algorithm for $\ell_{1}$ -regularized problems (see, e.g., [7]). In [4] an accelerated version of ISTA (called Fast ISTA, or FISTA) was proposed, and in [16] it was observed that the convergence rate of FISTA can be further improved by means of adaptive restarts. We use the restarted FISTA described in [16] as the benchmark for the experiments of this part. We also reported the performances of the conservative-restart algorithm introduced in [22]. The results are illustrated in Figure 2.

Quadratic function with $\ell_{1}$ -regularization

We considered a function $f:\mathbb{R}^{n}\to\mathbb{R}$ of the form

[TABLE]

where $M\in\mathbb{R}^{n\times n}$ is a symmetric positive definite matrix with eigenvalues sampled uniformly in the interval $[0.02,100]$ , and $b\in\mathbb{R}^{n}$ was generated with a Gaussian distribution $\mathcal{N}(0,4)$ . We set $\gamma=0.25\,|b|_{\infty}$ , and we sampled the starting point using $\mathcal{N}(0,2)$ . We fixed the dimension $n=1000$ . We observe that the objective function $f$ is strongly convex, hence, in principle, it could be possible to consider optimization schemes designed for strongly convex problems. However, their efficiency relies on how sharp is the available estimate of the strong convexity constant. On the other hand, both restarted-FISTA and Algorithm 2 do not require this information. This is one of the features of restarted-FISTA highlighted in [16].

Quadratic regression with $\ell_{1}$ -regularization

We considered a sparse quadratic regression problem. We generated a sparse random vector $y\in\mathbb{R}^{n}$ whose components were non-zero with probability $p=0.3$ . These values were sampled using a uniform distribution over $[0,1]$ . We took a matrix $A\in\mathbb{R}^{m\times n}$ whose singular values were uniformly sampled in $[1,10]$ , and we set $b=Ay+w$ , where $w\in\mathbb{R}^{m}$ represented a Gaussian noise distributed as $\mathcal{N}(0,0.1)$ . Finally, the objective function had the form

[TABLE]

with $\gamma=1$ . We used $n=1000$ and $m=500$ , and we sampled the component of the initial guess with $\mathcal{N}(0,2)$ . This problem is non-strongly convex, since the matrix $M=A^{T}A$ has not full rank.

Logistic regression with $\ell_{1}$ -regularization

We considered a sparse logistic regression problem. We constructed $x^{\mathrm{real}}\in\mathbb{R}^{n}$ with the following procedure: each component was zero with probability $p=0.8$ , and, if nonzero, its value was sampled using a standard normal $\mathcal{N}(0,1)$ . Then, we independently sampled the entries of $b=(b_{1},\ldots,b_{m})\in\{0,1\}^{m}$ using the distribution: $\mathbb{P}(b_{i}=1)=(1+\exp(\langle M_{i},x^{\mathrm{real}}\rangle))$ for every $i=1,\ldots,m$ , where $M_{1},\ldots,M_{m}\in\mathbb{R}^{n}$ are the rows of a matrix $M\in\mathbb{R}^{m\times n}$ with independent components generated with $\mathcal{N}(0,1)$ . Supposing to know the matrix $M$ and the measurements $b$ , the sparse log-likelyhood maximization can be formulated as the problem of minimizing

[TABLE]

where we set $\gamma=0.25|\nabla g(0)|_{\infty}$ . We used $n=100$ and $m=500$ , and we sampled the component of the initial guess with $\mathcal{N}(0,2)$ . This problem is convex but not strongly convex.

LogSumExp with $\ell_{1}$ -regularization

We considered the function $f:\mathbb{R}^{n}\to\mathbb{R}$ defined as follows:

[TABLE]

where $M_{1},\ldots,M_{k}\in\mathbb{R}^{n}$ are the rows of the matrix $M\in\mathbb{R}^{k\times n}$ , and $b\in\mathbb{R}^{k}$ . The entries of $M$ and $b$ were independently sampled using a Gaussian $\mathcal{N}(0,1)$ , as well as the components of the starting point. We set $r=5$ , and we used $n=200$ and $k=500$ . This is another example of non-strongly convex problem.

We briefly comment on the results of the experiments described above. We observe that the non-accelerated algorithms, i.e., Algorithm 1 and ISTA, have always very similar performances. Restarted FISTA is the most performing in the strongly convex case, while Algorithm 2 seems to be the most efficient with non-strongly convex objectives. If compared to the restart-conservative of [22], we observe that Algorithm 2 is much faster in the early phases of the minimization process. Finally, the classical subgradient method with diminishing step-size is the less performing scheme.

The fact that the decays achieved Algorithm 1 and ISTA are almost identical motivated us to construct an example where the difference in performances could be more apparent. We considered a two-dimensional function such that $x^{*}=(1,0)$ , and such that $\partial f(x^{*})=\{0\}\times[0,\epsilon]$ , for some $\epsilon>0$ . More precisely, we defined $f:\mathbb{R}^{2}\to\mathbb{R}$ as

[TABLE]

and we set $c=0.85$ and $\gamma=1$ . In this case, the correct individuation of the fact that the second component of the minimizer is null can be challenging. This is due to the identity $\partial_{2}f(x^{*})=[0,\epsilon]$ , or, in other words, since the vector [math] does not lie in the relative interior of $\partial f(x^{*})$ . In this scenario, in the case a crossing of the set $\{x\in\mathbb{R}^{2}:x_{2}=0\}$ occurs, we expect that Algorithm 1 might better decide whether the component $x_{2}$ should be set equal to [math]. We used as initial guess the point $x^{0}=(0.95,0.5)$ . We also considered a family of problems obtained by perturbing $c$ , $\gamma$ and $x^{0}$ with Gaussian noise of standard deviations, respectively, equal to $0.1$ , $0.1$ and $0.05$ . The results are reported in Figure 3. We observe that Algorithm 1 achieves better performances than ISTA on the designed problem, and this advantage seems to be robust with respect to the perturbations introduced. Finally, despite using step-sizes that decay faster than in the previous experiments, the classical subgradient method exhibits evident oscillations, both in the original and in the noisy problem.

Conclusions

In this paper, we considered composite convex optimization problems with $\ell_{1}$ -penalization, and we formulated a subgradient algorithm with constant step-size. In the case of strongly convex objectives, we established a linear convergence result for the method. Using dynamical system considerations, we proposed an accelerated version of the subgradient algorithm, that, at each iteration, achieves a decay of the objective always greater or equal than the decay corresponding to a step of the non-accelerated subgradient method. We observed in numerical experiments that the inertial algorithm can effectively compete with one of the most performing schemes for this kind of problems, i.e., FISTA combined with an adaptive restart strategy.

For future work, it could be interesting to design subgradient algorithms for composite optimization involving a non-smooth term of the form $x\mapsto|Ax|_{1}$ . In this case, a challenging point consists in finding strategies for computing $\partial^{-}f$ (or a suitable approximation) that could be practical for high-dimensional settings.

Acknowledgments

This paper is dedicated to the beloved memory of Prof. Piero Colli Franzone. A.S. acknowledges partial support from INdAM-GNAMPA. A.S. wants to thank two anonymous Referees for the helpful comments that contributed to improve the quality of the paper.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Attouch, J. Peypouquet, P. Redont. Fast convex optimization via inertial dynamics with Hessian driven damping. Journal of Differential Equations , 261:5734–5783, 2016. doi: 10.1016/j.jde.2016.08.020
2[2] H. Attouch, Z. Chbani, J. Peypouquet, P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. , 168:123–175, 2018. doi: 10.1007/s 10107-016-0992-8
3[3] D. Bertsekas. Convex Optimization Algorithms. Athena Scientific, Nashua, 2015.
4[4] A. Beck, M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. , 2:183–202, 2009. doi: 10.1137/080716542
5[5] J. Bolte, T.P. Nguyen, J. Peypouquet, B.W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. , 165:471–507, 2017. doi: 10.1007/s 10107-016-1091-6
6[6] E. Candès, J.K. Romberg, T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. , 59: 1207–1223, 2008. doi: 10.1002/cpa.20124
7[7] P.L. Combettes, V.R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Model. Sim. , 4(4):1168–1200, 2005. doi: 10.1137/050626090
8[8] D. Davis, D. Drusvyatskiy, K.J. Mac Phee, C. Paquette. Subgradient Methods for Sharp Weakly Convex Functions. J. Optim. Theory Appl. , 179: 962–982, 2018. doi: 10.1007/s 10957-018-1372-8

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A subgradient method with constant step-size for ℓ1\ell_{1}ℓ1​-composite optimization

Abstract.

Introduction

1. Preliminary results

Definition 1**.**

Remark 1*.*

Lemma 1.1**.**

Proof.

Definition 2**.**

Lemma 1.2**.**

Remark 2*.*

Proof.

2. Subgradient method and convergence analysis

Theorem 2.1**.**

Proof.

Remark 3*.*

3. Accelerated subgradient method

Remark 4*.*

Remark 5*.*

Proposition 3.1**.**

Remark 6*.*

Proof.

4. Numerical experiments

Quadratic function with ℓ1\ell_{1}ℓ1​-regularization

Quadratic regression with ℓ1\ell_{1}ℓ1​-regularization

Logistic regression with ℓ1\ell_{1}ℓ1​-regularization

LogSumExp with ℓ1\ell_{1}ℓ1​-regularization

Conclusions

Acknowledgments

A subgradient method with constant step-size for $\ell_{1}$ -composite optimization

Definition 1.

*Remark 1**.*

Lemma 1.1.

Definition 2.

Lemma 1.2.

*Remark 2**.*

Theorem 2.1.

*Remark 3**.*

*Remark 4**.*

*Remark 5**.*

Proposition 3.1.

*Remark 6**.*

Quadratic function with $\ell_{1}$ -regularization

Quadratic regression with $\ell_{1}$ -regularization

Logistic regression with $\ell_{1}$ -regularization

LogSumExp with $\ell_{1}$ -regularization