First-order optimization algorithms via inertial systems with Hessian   driven damping

Hedy Attouch; Zaki Chbani; Jalal Fadili; Hassan Riahi

arXiv:1907.10536·math.OC·November 9, 2020·Math. Program.

First-order optimization algorithms via inertial systems with Hessian driven damping

Hedy Attouch, Zaki Chbani, Jalal Fadili, Hassan Riahi

PDF

TL;DR

This paper introduces new first-order optimization algorithms inspired by inertial systems with Hessian-driven damping, achieving rapid convergence and extending to non-smooth convex functions with acceleration techniques.

Contribution

The paper develops novel first-order algorithms based on inertial dynamics with Hessian-driven damping, extending them to non-smooth functions and incorporating acceleration via time scale factors.

Findings

01

Algorithms exhibit rapid convergence towards zero gradients.

02

Extension to non-smooth convex functions using Moreau envelope.

03

Numerical results support theoretical convergence claims.

Abstract

In a Hilbert space setting, for convex optimization, we analyze the convergence rate of a class of first-order algorithms involving inertial features. They can be interpreted as discrete time versions of inertial dynamics involving both viscous and Hessian-driven dampings. The geometrical damping driven by the Hessian intervenes in the dynamics in the form $\nabla^{2} f (x (t)) \overset{x}{˙} (t)$ . By treating this term as the time derivative of $\nabla f (x (t))$ , this gives, in discretized form, first-order algorithms in time and space. In addition to the convergence properties attached to Nesterov-type accelerated gradient methods, the algorithms thus obtained are new and show a rapid convergence towards zero of the gradients. On the basis of a regularization technique using the Moreau envelope, we extend these methods to non-smooth convex functions with extended real values. The…

Figures9

Click any figure to enlarge with its caption.

Equations507

⎩ ⎨ ⎧ H is a real Hilbert space; f : H \to R is a convex function of class C^{2}, S := argmin_{H} f \neq = \emptyset; γ, β, b : [t_{0}, + \infty [\to R^{+} are non-negative continuous functions, t_{0} > 0.

⎩ ⎨ ⎧ H is a real Hilbert space; f : H \to R is a convex function of class C^{2}, S := argmin_{H} f \neq = \emptyset; γ, β, b : [t_{0}, + \infty [\to R^{+} are non-negative continuous functions, t_{0} > 0.

\overset{x}{¨} (t) + γ (t) \overset{x}{˙} (t) + β (t) \nabla^{2} f (x (t)) \overset{x}{˙} (t) + b (t) \nabla f (x (t)) = 0,

\overset{x}{¨} (t) + γ (t) \overset{x}{˙} (t) + β (t) \nabla^{2} f (x (t)) \overset{x}{˙} (t) + b (t) \nabla f (x (t)) = 0,

{y_{k} = x_{k} + α_{k} (x_{k} - x_{k - 1}) - β_{k} (\nabla f (x_{k}) - \nabla f (x_{k - 1})) x_{k + 1} = T (y_{k}),

{y_{k} = x_{k} + α_{k} (x_{k} - x_{k - 1}) - β_{k} (\nabla f (x_{k}) - \nabla f (x_{k - 1})) x_{k + 1} = T (y_{k}),

(DIN-AVD)_{α, β, b} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + β (t) \nabla^{2} f (x (t)) \overset{x}{˙} (t) + b (t) \nabla f (x (t)) = 0.

(DIN-AVD)_{α, β, b} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + β (t) \nabla^{2} f (x (t)) \overset{x}{˙} (t) + b (t) \nabla f (x (t)) = 0.

(DIN)_{2 μ, β} \overset{x}{¨} (t) + 2 μ \overset{x}{˙} (t) + β \nabla^{2} f (x (t)) \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(DIN)_{2 μ, β} \overset{x}{¨} (t) + 2 μ \overset{x}{˙} (t) + β \nabla^{2} f (x (t)) \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(HBF) \overset{x}{¨} (t) + γ \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(HBF) \overset{x}{¨} (t) + γ \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(DIN)_{γ, β} \overset{x}{¨} (t) + γ \overset{x}{˙} (t) + β \nabla^{2} f (x (t)) \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(DIN)_{γ, β} \overset{x}{¨} (t) + γ \overset{x}{˙} (t) + β \nabla^{2} f (x (t)) \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

\overset{x}{¨} (t) + Γ \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

\overset{x}{¨} (t) + Γ \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(AVD)_{α} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(AVD)_{α} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(DIN-AVD)_{α, β} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + β \nabla^{2} f (x (t)) \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

(DIN-AVD)_{α, β} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + β \nabla^{2} f (x (t)) \overset{x}{˙} (t) + \nabla f (x (t)) = 0,

\left\{\begin{array}[]{l}\dot{x}(t)+\beta\nabla f(x(t))-\left(\frac{1}{\beta}-\frac{\alpha}{t}\right)x(t)+\frac{1}{\beta}y(t)=0;\\ \rule{0.0pt}{10.0pt}\dot{y}(t)-\left(\frac{1}{\beta}-\frac{\alpha}{t}+\frac{\alpha\beta}{t^{2}}\right)x(t)+\frac{1}{\beta}y(t)=0.\end{array}\right.

\left\{\begin{array}[]{l}\dot{x}(t)+\beta\nabla f(x(t))-\left(\frac{1}{\beta}-\frac{\alpha}{t}\right)x(t)+\frac{1}{\beta}y(t)=0;\\ \rule{0.0pt}{10.0pt}\dot{y}(t)-\left(\frac{1}{\beta}-\frac{\alpha}{t}+\frac{\alpha\beta}{t^{2}}\right)x(t)+\frac{1}{\beta}y(t)=0.\end{array}\right.

{y_{k} = x_{k} + (1 - \frac{α}{k}) (x_{k} - x_{k - 1}) - β s (\nabla f (x_{k}) - \nabla f (x_{k - 1})) - \frac{β s}{k} \nabla f (x_{k - 1}) x_{k + 1} = y_{k} - s \nabla f (y_{k}) .

{y_{k} = x_{k} + (1 - \frac{α}{k}) (x_{k} - x_{k - 1}) - β s (\nabla f (x_{k}) - \nabla f (x_{k - 1})) - \frac{β s}{k} \nabla f (x_{k - 1}) x_{k + 1} = y_{k} - s \nabla f (y_{k}) .

(i) f (x_{k}) - H min f = O (\frac{1}{k ^{2}}) as k \to + \infty;

(i) f (x_{k}) - H min f = O (\frac{1}{k ^{2}}) as k \to + \infty;

(ii) k \sum k^{2} ∥\nabla f (y_{k}) ∥^{2} < + \infty and k \sum k^{2} ∥\nabla f (x_{k}) ∥^{2} < + \infty.

x_{k + 1} = x_{k} + \frac{1 - μ s}{1 + μ s} (x_{k} - x_{k - 1}) - \frac{β s}{1 + μ s} (\nabla f (x_{k}) - \nabla f (x_{k - 1})) - \frac{s}{1 + μ s} \nabla f (x_{k}) .

x_{k + 1} = x_{k} + \frac{1 - μ s}{1 + μ s} (x_{k} - x_{k - 1}) - \frac{β s}{1 + μ s} (\nabla f (x_{k}) - \nabla f (x_{k - 1})) - \frac{s}{1 + μ s} \nabla f (x_{k}) .

f (x_{k}) - H min f = O (q^{k}) and ∥ x_{k} - x^{⋆} ∥ = O (q^{k /2}) \mbox a s k \to + \infty,

f (x_{k}) - H min f = O (q^{k}) and ∥ x_{k} - x^{⋆} ∥ = O (q^{k /2}) \mbox a s k \to + \infty,

(DIN-AVD)_{α, β, b} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + β (t) \nabla^{2} f (x (t)) \overset{x}{˙} (t) + b (t) \nabla f (x (t)) = 0.

(DIN-AVD)_{α, β, b} \overset{x}{¨} (t) + \frac{α}{t} \overset{x}{˙} (t) + β (t) \nabla^{2} f (x (t)) \overset{x}{˙} (t) + b (t) \nabla f (x (t)) = 0.

w (t) := b (t) - \dot{β} (t) - \frac{β ( t )}{t} \mbox an d δ (t) := t^{2} w (t) .

w (t) := b (t) - \dot{β} (t) - \frac{β ( t )}{t} \mbox an d δ (t) := t^{2} w (t) .

(G_{2}) b (t) > \dot{β} (t) + \frac{β ( t )}{t};

(G_{2}) b (t) > \dot{β} (t) + \frac{β ( t )}{t};

(G_{3}) t \overset{w}{˙} (t) \leq (α - 3) w (t) .

(i) f (x (t)) - H min f = O (\frac{1}{t ^{2} w ( t )}) \mbox a s t \to + \infty;

(i) f (x (t)) - H min f = O (\frac{1}{t ^{2} w ( t )}) \mbox a s t \to + \infty;

(ii) \int_{t_{0}}^{+ \infty} t^{2} β (t) w (t) ∥ \nabla f (x (t)) ∥^{2} d t < + \infty;

\displaystyle(iii)\,\int_{t_{0}}^{+\infty}t\Big{(}(\alpha-3)w(t)-t\dot{w}(t)\Big{)}(f(x(t))-\min_{{\mathcal{H}}}f)dt<+\infty.

E (t) := δ (t) (f (x (t)) - f (x^{⋆})) + \frac{1}{2} ∥ v (t) ∥^{2},

E (t) := δ (t) (f (x (t)) - f (x^{⋆})) + \frac{1}{2} ∥ v (t) ∥^{2},

\frac{d}{d t} E (t) = \dot{δ} (t) (f (x (t)) - f (x^{⋆})) + δ (t) ⟨ \nabla f (x (t)), \overset{x}{˙} (t)⟩ + ⟨ v (t), \overset{v}{˙} (t)⟩ .

\frac{d}{d t} E (t) = \dot{δ} (t) (f (x (t)) - f (x^{⋆})) + δ (t) ⟨ \nabla f (x (t)), \overset{x}{˙} (t)⟩ + ⟨ v (t), \overset{v}{˙} (t)⟩ .

\begin{array}[]{lll}\dot{v}(t)&=&\alpha\dot{x}(t)+\beta(t)\nabla f(x(t))+t\big{[}\ddot{x}(t)+\dot{\beta}(t)\nabla f(x(t))+\beta(t)\nabla^{2}f(x(t))\dot{x}(t)\big{]}\vspace{2mm}\\ &=&\alpha\dot{x}(t)+\beta(t)\nabla f(x(t))+t\big{[}-\frac{\alpha}{t}\dot{x}(t)+(\dot{\beta}(t)-b(t))\nabla f(x(t))\big{]}\vspace{2mm}\\ &=&t\big{[}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\big{]}\nabla f(x(t)).\end{array}

\begin{array}[]{lll}\dot{v}(t)&=&\alpha\dot{x}(t)+\beta(t)\nabla f(x(t))+t\big{[}\ddot{x}(t)+\dot{\beta}(t)\nabla f(x(t))+\beta(t)\nabla^{2}f(x(t))\dot{x}(t)\big{]}\vspace{2mm}\\ &=&\alpha\dot{x}(t)+\beta(t)\nabla f(x(t))+t\big{[}-\frac{\alpha}{t}\dot{x}(t)+(\dot{\beta}(t)-b(t))\nabla f(x(t))\big{]}\vspace{2mm}\\ &=&t\big{[}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\big{]}\nabla f(x(t)).\end{array}

\begin{array}[]{lll}\langle v(t),\,\dot{v}(t)\rangle&=&(\alpha-1)t\Big{(}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\Big{)}\langle\nabla f(x(t)),\,x(t)-x^{\star}\rangle\\ &&+t^{2}\Big{(}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\Big{)}\langle\nabla f(x(t)),\,\dot{x}(t)\rangle\\ &&+t^{2}\beta(t)\Big{(}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\Big{)}\left\|{\nabla f(x(t))}\right\|^{2}.\end{array}

\begin{array}[]{lll}\langle v(t),\,\dot{v}(t)\rangle&=&(\alpha-1)t\Big{(}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\Big{)}\langle\nabla f(x(t)),\,x(t)-x^{\star}\rangle\\ &&+t^{2}\Big{(}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\Big{)}\langle\nabla f(x(t)),\,\dot{x}(t)\rangle\\ &&+t^{2}\beta(t)\Big{(}\dot{\beta}(t)+\dfrac{\beta(t)}{t}-b(t)\Big{)}\left\|{\nabla f(x(t))}\right\|^{2}.\end{array}

\begin{array}[]{lll}\dfrac{d}{dt}E(t)&=&\dot{\delta}(t)(f(x(t))-f(x^{\star}))+\frac{(\alpha-1)}{t}\delta(t)\langle\nabla f(x(t)),\,x^{\star}-x(t)\rangle\\ &-&\beta(t)\delta(t)\left\|{\nabla f(x(t))}\right\|^{2}.\end{array}

\begin{array}[]{lll}\dfrac{d}{dt}E(t)&=&\dot{\delta}(t)(f(x(t))-f(x^{\star}))+\frac{(\alpha-1)}{t}\delta(t)\langle\nabla f(x(t)),\,x^{\star}-x(t)\rangle\\ &-&\beta(t)\delta(t)\left\|{\nabla f(x(t))}\right\|^{2}.\end{array}

f (x^{⋆}) - f (x (t)) \geq ⟨ \nabla f (x (t)), x^{⋆} - x (t)⟩,

f (x^{⋆}) - f (x (t)) \geq ⟨ \nabla f (x (t)), x^{⋆} - x (t)⟩,

\dfrac{d}{dt}E(t)+\beta(t)\delta(t)\left\|{\nabla f(x(t))}\right\|^{2}+\Big{[}\frac{(\alpha-1)}{t}\delta(t)-\dot{\delta}(t)\Big{]}(f(x(t))-f(x^{\star}))\leq 0.

\dfrac{d}{dt}E(t)+\beta(t)\delta(t)\left\|{\nabla f(x(t))}\right\|^{2}+\Big{[}\frac{(\alpha-1)}{t}\delta(t)-\dot{\delta}(t)\Big{]}(f(x(t))-f(x^{\star}))\leq 0.

\frac{(\alpha-1)}{t}\delta(t)-\dot{\delta}(t)=t\Big{(}(\alpha-3)w(t)-t\dot{w}(t)\Big{)}.

\frac{(\alpha-1)}{t}\delta(t)-\dot{\delta}(t)=t\Big{(}(\alpha-3)w(t)-t\dot{w}(t)\Big{)}.

\frac{( α - 1 )}{t} δ (t) - \dot{δ} (t) \geq 0,

\frac{( α - 1 )}{t} δ (t) - \dot{δ} (t) \geq 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: H. Attouch 22institutetext: IMAG, Univ. Montpellier, CNRS, Montpellier, France

22email: [email protected] 33institutetext: Z. Chbani 44institutetext: Cadi Ayyad Univ., Faculty of Sciences Semlalia, Mathematics, 40000 Marrakech, Morroco

44email: [email protected] 55institutetext: J. Fadili 66institutetext: GREYC CNRS UMR 6072, Ecole Nationale Supérieure d’Ingénieurs de Caen, France

66email: [email protected] 77institutetext: H. Riahi 88institutetext: Cadi Ayyad Univ., Faculty of Sciences Semlalia, Mathematics, 40000 Marrakech, Morroco 88email: [email protected]

First-order optimization algorithms via inertial systems with Hessian driven damping

Hedy Attouch

Zaki Chbani

Jalal Fadili

Hassan Riahi

(July 19, 2019)

Abstract

In a Hilbert space setting, for convex optimization, we analyze the convergence rate of a class of first-order algorithms involving inertial features. They can be interpreted as discrete time versions of inertial dynamics involving both viscous and Hessian-driven dampings. The geometrical damping driven by the Hessian intervenes in the dynamics in the form $\nabla^{2}f(x(t))\dot{x}(t)$ . By treating this term as the time derivative of $\nabla f(x(t))$ , this gives, in discretized form, first-order algorithms in time and space. In addition to the convergence properties attached to Nesterov-type accelerated gradient methods, the algorithms thus obtained are new and show a rapid convergence towards zero of the gradients. On the basis of a regularization technique using the Moreau envelope, we extend these methods to non-smooth convex functions with extended real values. The introduction of time scale factors makes it possible to further accelerate these algorithms. We also report numerical results on structured problems to support our theoretical findings.

Keywords:

Hessian driven damping; inertial optimization algorithms; Nesterov accelerated gradient method; Ravine method; time rescaling.

††journal: Math. Program. Ser. A

AMS subject classification

37N40, 46N10, 49M30, 65B99, 65K05, 65K10, 90B50, 90C25.

1 Introduction

Unless specified, throughout the paper we make the following assumptions

[TABLE]

As a guide in our study, we will rely on the asymptotic behavior, when $t\to+\infty$ , of the trajectories of the inertial system with Hessian-driven damping

[TABLE]

$\gamma(t)$ and $\beta(t)$ are damping parameters, and $b(t)$ is a time scale parameter.

The time discretization of this system will provide a rich family of first-order methods for minimizing $f$ . At first glance, the presence of the Hessian may seem to entail numerical difficulties. However, this is not the case as the Hessian intervenes in the above ODE in the form $\nabla^{2}f(x(t))\dot{x}(t)$ , which is nothing but the derivative w.r.t. time of $\nabla f(x(t))$ . This explains why the time discretization of this dynamic provides first-order algorithms. Thus, the Nesterov extrapolation scheme Nest1 ; Nest4 is modified by the introduction of the difference of the gradients at consecutive iterates. This gives algorithms of the form

[TABLE]

where $T$ , to be specified later, is an operator involving the gradient or the proximal operator of $f$ .

Coming back to the continuous dynamic, we will pay particular attention to the following two cases, specifically adapted to the properties of $f$ :

$\bullet$

For a general convex function $f$ , taking $\gamma(t)=\frac{\alpha}{t}$ , gives

[TABLE]

In the case $\beta\equiv 0$ , $\alpha=3$ , $b(t)\equiv 1$ , it can be interpreted as a continuous version of the Nesterov accelerated gradient method SBC . According to this, in this case, we will obtain ${\mathcal{O}}\left({t^{-2}}\right)$ convergence rates for the objective values. 2. $\bullet$

For a $\mu$ -strongly convex function $f$ , we will rely on the autonomous inertial system with Hessian driven damping

[TABLE]

and show exponential (linear) convergence rate for both objective values and gradients.

For an appropriate setting of the parameters, the time discretization of these dynamics provides first-order algorithms with fast convergence properties. Notably, we will show a rapid convergence towards zero of the gradients.

1.1 A historical perspective

B. Polyak initiated the use of inertial dynamics to accelerate the gradient method in optimization. In Pol ; Polyak2 , based on the inertial system with a fixed viscous damping coefficient $\gamma>0$

[TABLE]

he introduced the Heavy Ball with Friction method. For a strongly convex function $f$ , (HBF) provides convergence at exponential rate of $f(x(t))$ to $\min_{{\mathcal{H}}}f$ . For general convex functions, the asymptotic convergence rate of (HBF) is ${\mathcal{O}}(\frac{1}{t})$ (in the worst case). This is however not better than the steepest descent. A decisive step to improve (HBF) was taken by Alvarez-Attouch-Bolte-Redont AABR by introducing the Hessian-driven damping term $\beta\nabla^{2}f(x(t))\dot{x}(t)$ , that is (DIN)0,β . The next important step was accomplished by Su-Boyd-Candès SBC with the introduction of a vanishing viscous damping coefficient $\gamma(t)=\frac{\alpha}{t}$ , that is (AVD)α (see Section 1.1.2). The system (DIN-AVD)α,β,1 (see Section 2) has emerged as a combination of (DIN)0,β and (AVD)α . Let us review some basic facts concerning these systems.

1.1.1 The (DIN)γ,β dynamic

The inertial system

[TABLE]

was introduced in AABR . In line with (HBF), it contains a fixed positive friction coefficient $\gamma$ . The introduction of the Hessian-driven damping makes it possible to neutralize the transversal oscillations likely to occur with (HBF), as observed in AABR in the case of the Rosenbrook function. The need to take a geometric damping adapted to $f$ had already been observed by Alvarez Alv0 who considered

[TABLE]

where $\Gamma:{\mathcal{H}}\to{\mathcal{H}}$ is a linear positive anisotropic operator. But still this damping operator is fixed. For a general convex function, the Hessian-driven damping in (DIN)γ,β performs a similar operation in a closed-loop adaptive way. The terminology (DIN) stands shortly for Dynamical Inertial Newton. It refers to the natural link between this dynamic and the continuous Newton method.

1.1.2 The (AVD)α dynamic

The inertial system

[TABLE]

was introduced in the context of convex optimization in SBC . For general convex functions it provides a continuous version of the accelerated gradient method of Nesterov. For $\alpha\geq 3$ , each trajectory $x(\cdot)$ of (AVD)α satisfies the asymptotic rate of convergence of the values $f(x(t))-\inf_{{\mathcal{H}}}f={\mathcal{O}}\left(1/t^{2}\right)$ . As a specific feature, the viscous damping coefficient $\frac{\alpha}{t}$ vanishes (tends to zero) as time $t$ goes to infinity, hence the terminology. The convergence properties of the dynamic (AVD)α have been the subject of many recent studies, see AAD ; AC1 ; AC2 ; AC2R-EECT ; ACPR ; ACR-subcrit ; AP ; AD ; AD17 ; May ; SBC . They helped to explain why $\frac{\alpha}{t}$ is a wise choise of the damping coefficient.

In CEG1 , the authors showed that a vanishing damping coefficient $\gamma(\cdot)$ dissipates the energy, and hence makes the dynamic interesting for optimization, as long as $\int_{t_{0}}^{+\infty}\gamma(t)dt=+\infty$ . The damping coefficient can go to zero asymptotically but not too fast. The smallest which is admissible is of order $\frac{1}{t}$ . It enforces the inertial effect with respect to the friction effect.

The tuning of the parameter $\alpha$ in front of $\frac{1}{t}$ comes from the Lyapunov analysis and the optimality of the convergence rates obtained. The case $\alpha=3$ , which corresponds to Nesterov’s historical algorithm, is critical. In the case $\alpha=3$ , the question of the convergence of the trajectories remains an open problem (except in one dimension where convergence holds ACR-subcrit ). As a remarkable property, for $\alpha>3$ , it has been shown by Attouch-Chbani-Peypouquet-Redont ACPR that each trajectory converges weakly to a minimizer. The corresponding algorithmic result has been obtained by Chambolle-Dossal CD . For $\alpha>3$ , it is shown in AP and May that the asymptotic convergence rate of the values is actually $o(1/t^{2})$ . The subcritical case $\alpha\leq 3$ has been examined by Apidopoulos-Aujol-DossalAAD and Attouch-Chbani-Riahi ACR-subcrit , with the convergence rate of the objective values $\displaystyle{{\mathcal{O}}\left({t^{-\frac{2\alpha}{3}}}\right)}$ . These rates are optimal, that is, they can be reached, or approached arbitrarily close:

$\bullet$ $\alpha\geq 3$ : the optimal rate $\displaystyle{{\mathcal{O}}\left({t^{-2}}\right)}$ is achieved by taking $f(x)=\|x\|^{r}$ with $r\to+\infty$ ( $f$ become very flat around its minimum), see ACPR .

$\bullet$ $\alpha<3$ : the optimal rate $\displaystyle{{\mathcal{O}}\left({t^{-\frac{2\alpha}{3}}}\right)}$ is achieved by taking $f(x)=\|x\|$ , see AAD .

The inertial system with a general damping coefficient $\gamma(\cdot)$ was recently studied by Attouch-Cabot in AC1 ; AC2 , and Attouch-Cabot-Chbani-Riahi in AC2R-EECT .

1.1.3 The (DIN-AVD)α,β dynamic

The inertial system

[TABLE]

was introduced in APR . It combines the two types of damping considered above. Its formulation looks at a first glance more complicated than (AVD)α . In APR2 , Attouch-Peypouquet-Redont showed that (DIN-AVD)α,β is equivalent to the first-order system in time and space

[TABLE]

This provides a natural extension to $f:{\mathcal{H}}\to{\mathbb{R}}\cup\{+\infty\}$ proper lower semicontinuous and convex, just replacing the gradient by the subdifferential.

To get better insight, let us compare the two dynamics (AVD)α and (DIN-AVD)α,β on a simple quadratic minimization problem, in which case the trajectories can be computed in closed form as explained in Appendix A.3. Take ${\mathcal{H}}={\mathbb{R}}^{2}$ and $f(x_{1},x_{2})=\frac{1}{2}(x_{1}^{2}+1000x_{2}^{2})$ , which is ill-conditioned. We take parameters $\alpha=3.1$ , $\beta=1$ , so as to obey the condition $\alpha>3$ . Starting with initial conditions: $(x_{1}(1),x_{2}(1))=(1,1)$ , $(\dot{x}_{1}(1),\dot{x}_{2}(1))=(0,0)$ , we have the trajectories displayed in Figure 1. This illustrates the typical situation of an ill-conditioned minimization problem, where the wild oscillations of (AVD)α are neutralized by the Hessian damping in (DIN-AVD)α,β (see Appendix A.3 for further details).

1.2 Main algorithmic results

Let us describe our main convergence rates for the gradient type algorithms. Corresponding results for the proximal algorithms are also obtained.

General convex function

Let $f:{\mathcal{H}}\to{\mathbb{R}}$ be a convex function whose gradient is $L$ -Lipschitz continuous. Based on the discretization of (DIN-AVD) ${}_{\alpha,\beta,1+\frac{\beta}{t}}$ , we consider

[TABLE]

Suppose that $\alpha\geq 3$ , $0<\beta<2\sqrt{s}$ , $sL\leq 1$ . In Theorem 3.3, we show that

[TABLE]

Strongly convex function

When $f:{\mathcal{H}}\to{\mathbb{R}}$ is $\mu$ -strongly convex for some $\mu>0$ , our analysis relies on the autonomous dynamic ${\rm(DIN)_{\gamma,\beta}}$ with $\gamma=2\sqrt{\mu}$ . Based on its time discretization, we obtain linear convergence results for the values (hence the trajectory) and the gradients terms. Explicit discretization gives the inertial gradient algorithm

[TABLE]

Assuming that $\nabla f$ is $L$ -Lipschitz continuous, $L$ sufficiently small and $\beta\leq\displaystyle{\frac{1}{\sqrt{\mu}}}$ , it is shown in Theorem 5.3 that, with $q=\displaystyle{\frac{1}{1+\frac{1}{2}\sqrt{\mu s}}}$ ( $0<q<1$ )

[TABLE]

Moreover, the gradients converge exponentially fast to zero.

1.3 Contents

The paper is organized as follows. Sections 2 and 3 deal with the case of general convex functions, respectively in the continuous case and the algorithmic cases. We improve the Nesterov convergence rates by showing in addition fast convergence of the gradients. Sections 4 and 5 deal with the same questions in the case of strongly convex functions, in which case, linear convergence results are obtained. Section 6 is devoted to numerical illustrations. We conclude with some perspectives.

2 Inertial dynamics for general convex functions

Our analysis deals with the inertial system with Hessian-driven damping

[TABLE]

2.1 Convergence rates

We start by stating a fairly general theorem on the convergence rates and integrability properties of (DIN-AVD)α,β,b under appropriate conditions on the parameter functions $\beta(t)$ and $b(t)$ . As we will discuss shortly, it turns out that for some specific choices of the parameters, one can recover most of the related results existing in the literature. The following quantities play a central role in our analysis:

[TABLE]

Theorem 2.1

Consider (DIN-AVD)α,β,b , where ( $\mathrm{H}$ ) holds. Take $\alpha\geq 1$ . Let $x:[t_{0},+\infty[\rightarrow{\mathcal{H}}$ be a solution trajectory of (DIN-AVD)α,β,b . Suppose that the following growth conditions are satisfied:

[TABLE]

Then, $w(t)$ is positive and

[TABLE]

Proof

Given $x^{\star}\in\operatorname{argmin}_{{\mathcal{H}}}f$ , define for $t\geq t_{0}$

[TABLE]

where $v(t):=(\alpha-1)(x(t)-x^{\star})+t\left({\dot{x}(t)+\beta(t)\nabla f(x(t)}\right).$

The function $E(\cdot)$ will serve as a Lyapunov function. Differentiating $E$ gives

[TABLE]

Using equation (DIN-AVD)α,β,b , we have

[TABLE]

Hence,

[TABLE]

Let us go back to (3). According to the choice of $\delta(t)$ , the terms $\langle\nabla f(x(t)),\,\dot{x}(t)\rangle$ cancel, which gives

[TABLE]

Condition $(\mathcal{G}_{2})$ gives $\delta(t)>0$ . Combining this equation with convexity of $f$ ,

[TABLE]

we obtain the inequality

[TABLE]

Then note that

[TABLE]

Hence, condition $(\mathcal{G}_{3})$ writes equivalently

[TABLE]

which, by (4), gives $\dfrac{d}{dt}E(t)\leq 0$ . Therefore, $E(\cdot)$ is non-increasing, and hence $E(t)\leq E(t_{0})$ . Since all the terms that enter $E(\cdot)$ are nonnegative, we obtain $(i)$ . Then, by integrating (4) we get

[TABLE]

and

[TABLE]

which gives $(ii)$ and $(iii)$ , and completes the proof. ∎

2.2 Particular cases

As anticipated above, by specializing the functions $\beta(t)$ and $b(t)$ , we recover most known results in the literature; see hereafter for each specific case and related literature. For all these cases, we will argue also on the interest of our generalization.

Case 1

The (DIN-AVD)α,β system corresponds to $\beta(t)\equiv\beta$ and $b(t)\equiv 1$ . In this case, $w(t)=1-\frac{\beta}{t}$ . Conditions $(\mathcal{G}_{2})$ and $(\mathcal{G}_{3})$ are satisfied by taking $\alpha>3$ and $t>\frac{\alpha-2}{\alpha-3}\beta$ . Hence, as a consequence of Theorem 2.1, we obtain the following result of Attouch-Peypouquet-Redont APR2 :

Theorem 2.2 (APR2 )

Let $x:[t_{0},+\infty[\rightarrow{\mathcal{H}}$ be a trajectory of the dynamical system (DIN-AVD)α,β . Suppose $\alpha>3$ . Then

[TABLE]

Case 2

The system(DIN-AVD) ${}_{\alpha,\beta,1+\frac{\beta}{t}}$ , which corresponds to $\beta(t)\equiv\beta$ and $b(t)=1+\frac{\beta}{t}$ , was considered in SDJS . Compared to (DIN-AVD)α,β it has the additional coefficient $\frac{\beta}{t}$ in front of the gradient term. This vanishing coefficient will facilitate the computational aspects while keeping the structure of the dynamic. Observe that in this case, $w(t)\equiv 1$ . Conditions $(\mathcal{G}_{2})$ and $(\mathcal{G}_{3})$ boil down to $\alpha\geq 3$ . Hence, as a consequence of Theorem 2.1, we obtain

Theorem 2.3

Let $x:[t_{0},+\infty[\rightarrow{\mathcal{H}}$ be a solution trajectory of the dynamical system (DIN-AVD) ${}_{\alpha,\beta,1+\frac{\beta}{t}}$ . Suppose $\alpha\geq 3$ . Then

[TABLE]

Case 3

The dynamical system (DIN-AVD)α,0,b , which corresponds to $\beta(t)\equiv 0$ , was considered by Attouch-Chbani-Riahi in ACR-rescale . It comes also naturally from the time scaling of (AVD)α . In this case, we have $w(t)=b(t)$ . Condition $(\mathcal{G}_{2})$ is equivalent to $b(t)>0$ . $(\mathcal{G}_{3})$ becomes

[TABLE]

which is precisely the condition introduced in (ACR-rescale, , Theorem 8.1). Under this condition, we have the convergence rate

[TABLE]

This makes clear the acceleration effect due to the time scaling. For $b(t)=t^{r}$ , we have $f(x(t))-\min_{{\mathcal{H}}}f={\mathcal{O}}\left({\dfrac{1}{t^{2+r}}}\right)$ , under the assumption $\alpha\geq 3+r$ .

Case 4

Let us illustrate our results in the case $b(t)=ct^{b}$ , $\beta(t)=t^{\beta}$ . We have $w(t)=ct^{b}-(\beta+1)t^{\beta-1},w^{\prime}(t)=cbt^{b-1}-(\beta^{2}-1)t^{\beta-2}.$ The conditions $(\mathcal{G}_{2}),(\mathcal{G}_{3})$ can be written respectively as:

[TABLE]

When $b=\beta-1$ , the conditions (7) are equivalent to $\beta<c-1\;\text{ and }\;\beta\leq\alpha-2,$ which gives the convergence rate $f(x(t))-\min_{{\mathcal{H}}}f={\mathcal{O}}\left(\dfrac{1}{t^{\beta+1}}\right)$ .

Let us apply these choices to the quadratic function $f:(x_{1},x_{2})\in{\mathbb{R}}^{2}\mapsto\left(x_{1}+x_{2}\right)^{2}/2$ . $f$ is convex but not strongly so, and $\operatorname{argmin}_{{\mathbb{R}}^{2}}f=\{(x_{1},x_{2})\in{\mathbb{R}}^{2}:x_{2}=-x_{1}\}$ . The closed-form solution of the ODE with this choice of $\beta(t)$ and $b(t)$ is given in Appendix A.3. We choose the values $\alpha=5,\beta=3,b=\beta-1=2$ and $c=5$ in order to satisfy condition (7). The left panel of Figure 2 depicts the convergence profile of the function value, and its right panel the trajectories associated with the system (DIN-AVD)α,β,b for different scenarios of the parameters. Once again, the damping of oscillations due to the presence of the Hessian is observed.

Discussion

Let us first apply the above choices of $(\alpha,\beta(t),b(t))$ for each case to the quadratic function $f:(x_{1},x_{2})\in{\mathbb{R}}^{2}\mapsto\left(x_{1}+x_{2}\right)^{2}/2$ . $f$ is convex but not strongly so, and $\operatorname{argmin}_{{\mathbb{R}}^{2}}f=\{(x_{1},x_{2})\in{\mathbb{R}}^{2}:x_{2}=-x_{1}\}$ . The closed-form solution of (DIN-AVD)α,β,b with each choice of $\beta(t)$ and $b(t)$ is given in Appendix A.3. For all cases, we set $\alpha=5$ . For case 1, we set $\beta=b=1$ . For case 2, we take $\beta=1$ . As for case 3, we set $r=2$ . For case 4, we choose $\beta=3,b=\beta-1=2$ and $c=5$ in order to satisfy condition (7). The left panel of Figure 2 depicts the convergence profile of the function value as well as the predicted convergence rates ${\mathcal{O}}\left({1/t^{2}}\right)$ and ${\mathcal{O}}\left({1/t^{4}}\right)$ (the latter is for cases with time (re)scaling). The right panel of Figure 2 displays the associated trajectories for the different scenarios of the parameters.

The rates one can achieve in our Theorem 2.1 look similar to those in Theorem 2.2 and Theorem 2.3. Thus one may wonder whether our framework allowing for more general variable parameters is necessary. The answer is affirmative for several reasons. First, our framework can be seen as a one-stop shop allowing for a unified analysis with an unprecedented level of generality. It also handles time (re)scaling straightforwardly by appropriately setting the functions $\beta(t)$ and $b(t)$ (see Case 3 and 4 above). In addition, though these convergence rates appear similar, one has to keep in mind that these are upper-bounds. It turns out from our detailed example in the quadratic case introduced above in Figure 2, that not only the oscillations are reduced due to the presence of Hessian damping, but also the trajectory and the objective can be made much less oscillatory thanks to the flexible choice of the parameters allowed by our framework. This is yet again another evidence of the interest of our setting.

3 Inertial algorithms for general convex functions

3.1 Proximal algorithms

3.1.1 Smooth case

Writing the term $\nabla^{2}f(x(t))\dot{x}(t)$ in (DIN-AVD)α,β,b as the time derivative of $\nabla f(x(t))$ , and taking the implicit time discretization of this system, with step size $h>0$ , gives

[TABLE]

Equivalently

[TABLE]

Observe that this requires $f$ to be only of class $\mathcal{C}^{1}$ . Set now $s=h^{2}$ . We obtain the following algorithm with $\beta_{k}$ and $b_{k}$ varying with $k$ :

[TABLE]

Theorem 3.1

Assume that $f:{\mathcal{H}}\rightarrow{\mathbb{R}}$ is a convex $\mathcal{C}^{1}$ function. Suppose that $\alpha\geq 1$ . Set

[TABLE]

and suppose that the following growth conditions are satisfied:

[TABLE]

Then, $\delta_{k}$ is positive and, for any sequence $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ generated by $\mathrm{(IPAHD)}$

[TABLE]

Before delving into the proof, the following remarks on the choice/growth of the parameters are in order.

Remark 1

We first observe that condition $(\mathcal{G}^{\mathrm{dis}}_{2})$ is nothing but a forward (explicit) discretization of its continuous analogue $(\mathcal{G}_{2})$ . In addition, in view of (1), $(\mathcal{G}_{3})$ equivalently reads

[TABLE]

In turn, (10) and $(\mathcal{G}^{\mathrm{dis}}_{3})$ are explicit discretizations of (1) and $(\mathcal{G}_{3})$ respectively.

Remark 2

The convergence rate on the objective values in Theorem 3.1(i) is ${\mathcal{O}}\left({1/((k+1)k}\right)$ with the proviso that

[TABLE]

which in turn implies $(\mathcal{G}^{\mathrm{dis}}_{2})$ . If, in addition to (11), we also have $\inf_{k}\beta_{k}>0$ , then the summability property in Theorem 3.1(ii) reads $\sum_{k}k(k+1)\|\nabla f(x_{k+1})\|^{2}<+\infty$ . For instance, if $\beta_{k}$ is non-increasing and $b_{k}\geq c+\frac{\beta_{k+1}}{kh}$ , $c>0$ , then (11) is in force with $c$ as a lower-bound on the infimum. In summary, we get ${\mathcal{O}}\left({1/((k+1)k}\right)$ under fairly general assumptions on the growth of the sequences $\left({\beta_{k}}\right)_{k\in{\mathbb{N}}}$ and $\left({b_{k}}\right)_{k\in{\mathbb{N}}}$ .

Let us now exemplify choices of $\beta_{k}$ and $b_{k}$ that have the appropriate growth as above and comply with (11) (hence $(\mathcal{G}^{\mathrm{dis}}_{2})$ ) as well as $(\mathcal{G}^{\mathrm{dis}}_{3})$ .

$\bullet$

Let us take $\beta_{k}=\beta>0$ and $b_{k}=1$ , which is the discrete analogue of the continuous case 1 considered in Section 2.2 (recall that the continuous version was analyzed in APR2 ). Note however that APR2 did not study the discrete (algorithmic) case and thus our result is new even for this system. In such a case, $\delta_{k}=h^{2}(k+1)(k-\beta/h)$ and $\beta_{k}$ is obviously non-icnreasing. Thus, if $\alpha>3$ , then one easily checks that (11) (hence $(\mathcal{G}^{\mathrm{dis}}_{2})$ ) and $(\mathcal{G}^{\mathrm{dis}}_{3})$ are in force for all $k\geq\frac{\alpha-2}{\alpha-3}\frac{\beta}{h}+\frac{2}{\alpha-3}$ . 2. $\bullet$

Consider now the discrete counterpart of case 2 in Section 2.2. Take $\beta_{k}=\beta>0$ and $b_{k}=1+\beta/(hk)$ 111One can even consider the more general case $b(t)=1+b/(hk),b>0$ for which our discussion remains true under minor modifications. But we do not pursue this for the sake of simplicity.. Thus $\delta_{k}=h^{2}(k+1)k$ . This case was studied in SDJS both in the continuous setting and for the gradient algorithm, but not for the proximal algorithm. This choice is a special case of the one discussed above since $\beta_{k}$ is the constant sequence and $c=1$ . Thus (11) (hence $(\mathcal{G}^{\mathrm{dis}}_{2})$ ) holds. $(\mathcal{G}^{\mathrm{dis}}_{3})$ is also verified for all $k\geq\frac{2}{\alpha-3}$ as soon as $\alpha>3$ .

Proof

Given $x^{\star}\in\operatorname{argmin}_{{\mathcal{H}}}f$ , set

[TABLE]

where

[TABLE]

and $\left({\delta_{k}}\right)_{k\in{\mathbb{N}}}$ is a positive sequence that will be adjusted. Observe that $E_{k}$ is nothing but the discrete analogue of the Lyapunov function (2). Set $\Delta E_{k}:=E_{k+1}-E_{k}$ , i.e.,

[TABLE]

Let us evaluate the last term of the above expression with the help of the three-point identity $\frac{1}{2}\left\|{v_{k+1}}\right\|^{2}-\frac{1}{2}\left\|{v_{k}}\right\|^{2}=\langle v_{k+1}-v_{k},\,v_{k+1}\rangle-\frac{1}{2}\left\|{v_{k+1}-v_{k}}\right\|^{2}.$

Using successively the definition of $v_{k}$ and (8), we get

[TABLE]

Set shortly $C_{k}=\beta_{k+1}+k(\beta_{k+1}-\beta_{k})-b_{k}hk$ . We have obtained

[TABLE]

By virtue of $(\mathcal{G}^{\mathrm{dis}}_{2})$ , we have

[TABLE]

Then, in the above expression, the coefficient of $\|\nabla f(x_{k+1})\|^{2}$ is less or equal than zero, which gives

[TABLE]

According to the (convex) subdifferential inequality and $C_{k}<0$ (by $(\mathcal{G}^{\mathrm{dis}}_{2})$ ), we infer

[TABLE]

Take $\delta_{k}:=-hC_{k}(k+1)=h\Big{(}b_{k}hk-\beta_{k+1}-k(\beta_{k+1}-\beta_{k})\Big{)}(k+1)$ so that the terms $f(x_{k})-f(x_{k+1})$ cancel in $E_{k+1}-E_{k}$ . We obtain

[TABLE]

Equivalently

[TABLE]

By assumption $(\mathcal{G}^{\mathrm{dis}}_{3})$ , we have $\delta_{k+1}-\delta_{k}-(\alpha-1)\frac{\delta_{k}}{k+1}\leq 0$ . Therefore, the sequence $\left({E_{k}}\right)_{k\in{\mathbb{N}}}$ is non-increasing, which, by definition of $E_{k}$ , gives, for $k\geq 0$

[TABLE]

By summing the inequalities

[TABLE]

we finally obtain $\sum_{k}\delta_{k}\beta_{k+1}\|\nabla f(x_{k+1})\|^{2}<+\infty.\quad$ ∎

3.1.2 Non-smooth case

Let $f:{\mathcal{H}}\to{\mathbb{R}}\cup\left\{+\infty\right\}$ be a proper lower semicontinuous and convex function. We rely on the basic properties of the Moreau-Yosida regularization. Let $f_{\lambda}$ be the Moreau envelope of $f$ of index $\lambda>0$ , which is defined by:

[TABLE]

We recall that $f_{\lambda}$ is a convex function, whose gradient is $\lambda^{-1}$ -Lipschitz continuous, such that $\operatorname{argmin}_{{\mathcal{H}}}f_{\lambda}=\operatorname{argmin}_{{\mathcal{H}}}f$ . The interested reader may refer to BC ; Bre1 for a comprehensive treatment of the Moreau envelope in a Hilbert setting. Since the set of minimizers is preserved by taking the Moreau envelope, the idea is to replace $f$ by $f_{\lambda}$ in the previous algorithm, and take advantage of the fact that $f_{\lambda}$ is continuously differentiable. The Hessian dynamic attached to $f_{\lambda}$ becomes

[TABLE]

However, we do not really need to work on this system (which requires $f_{\lambda}$ to be ${\mathcal{C}}^{2}$ ), but with the discretized form which only requires the function to be continuously differentiable, as is the case of $f_{\lambda}$ . Then, algorithm $\mathrm{(IPAHD)}$ applied to $f_{\lambda}$ now reads

[TABLE]

By applying Theorem 3.1 we obtain that under the assumption $(\mathcal{G}^{\mathrm{dis}}_{2})$ and $(\mathcal{G}^{\mathrm{dis}}_{3})$ ,

$f_{\lambda}(x_{k})-\min_{{\mathcal{H}}}f={\mathcal{O}}\left(\frac{1}{\delta_{k}}\right),\quad\sum_{k}\delta_{k}\beta_{k+1}\|\nabla f_{\lambda}(x_{k+1})\|^{2}<+\infty.$

Thus, we just need to formulate these results in terms of $f$ and its proximal mapping. This is straightforward thanks to the following formulae from proximal calculus BC :

[TABLE]

We obtain the following relaxed inertial proximal algorithm (NS stands for Non-Smooth):

[TABLE]

Theorem 3.2

Let $f:{\mathcal{H}}\to{\mathbb{R}}\cup\left\{+\infty\right\}$ be a convex, lower semicontinuous, proper function. Let the sequence $\left({\delta_{k}}\right)_{k\in{\mathbb{N}}}$ as defined in (10), and suppose that the growth conditions $(\mathcal{G}^{\mathrm{dis}}_{2})$ and $(\mathcal{G}^{\mathrm{dis}}_{3})$ in Theorem 3.1 are satisfied. Then, for any sequence $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ generated by (IPAHD-NS) , the following holds

[TABLE]

3.2 Gradient algorithms

Take $f$ a convex function whose gradient is $L$ -Lipschitz continuous. Our analysis is based on the dynamic (DIN-AVD) ${}_{\alpha,\beta,1+\frac{\beta}{t}}$ considered in Theorem 2.3 with damping parameters $\alpha\geq 3$ , $\beta\geq 0$ . Consider the time discretization of (DIN-AVD) ${}_{\alpha,\beta,1+\frac{\beta}{t}}$

[TABLE]

with $y_{k}$ inspired by Nesterov’s accelerated scheme. We obtain the following scheme:

[TABLE]

Following AC2 , set $t_{k+1}=\frac{k}{\alpha-1}$ , whence $t_{k}=1+t_{k+1}\alpha_{k}$ .

Given $x^{\star}\in\operatorname{argmin}_{{\mathcal{H}}}f$ , our Lyapunov analysis is based on the sequence $\left({E_{k}}\right)_{k\in{\mathbb{N}}}$

[TABLE]

Theorem 3.3

Let $f:{\mathcal{H}}\to{\mathbb{R}}$ be a convex function whose gradient is $L$ -Lipschitz continuous. Let $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ be a sequence generated by algorithm (IGAHD) , where $\alpha\geq 3$ , $0\leq\beta<2\sqrt{s}$ and $s\leq 1/L$ . Then the sequence $\left({E_{k}}\right)_{k\in{\mathbb{N}}}$ defined by (17)-(18) is non-increasing, and the following convergence rates are satisfied:

[TABLE]

Proof

We rely on the following reinforced version of the gradient descent lemma (Lemma 1 in Appendix A.1). Since $s\leq\frac{1}{L}$ , and $\nabla f$ is $L$ -Lipschitz continuous,

[TABLE]

for all $x,y\in{\mathcal{H}}$ . Let us write it successively at $y=y_{k}$ and $x=x_{k}$ , then at $y=y_{k}$ , $x=x^{\star}$ . According to $x_{k+1}=y_{k}-s\nabla f(y_{k})$ and $\nabla f(x^{\star})=0$ , we get

[TABLE]

Multiplying (19) by $t_{k+1}-1\geq 0$ , then adding (20), we derive that

[TABLE]

Let us multiply (3.2) by $t_{k+1}$ to make appear $E_{k}$ . We obtain

[TABLE]

Since $\alpha\geq 3$ we have $t_{k+1}^{2}-t_{k+1}-t_{k}^{2}\leq 0$ , which gives

[TABLE]

According to the definition of $E_{k}$ , we infer

[TABLE]

Let us compute this last expression with the help of the elementary identity

[TABLE]

By definition of $v_{k}$ , according to (IGAHD) and $t_{k}-1=t_{k+1}\alpha_{k}$ , we have

[TABLE]

Hence

[TABLE]

Collecting the above results, we obtain

[TABLE]

Equivalently

[TABLE]

with

[TABLE]

Consequently

[TABLE]

where

[TABLE]

When $\beta=0$ we have $B_{k}\geq 0$ . Let us analyze the sign of $B_{k}$ in the case $\beta>0$ . Set $Y=\nabla f(y_{k})$ , $X=\nabla f(x_{k})$ . We have

[TABLE]

Elementary algebra gives that the above quadratic form is non-negative when

[TABLE]

Recall that $t_{k}$ is of order $k$ . Hence, this inequality is satisfied for $k$ large enough if $(\beta\sqrt{s}-s)^{2}<s^{2}$ , which is equivalent to $\beta<2\sqrt{s}.$ Under this condition $E_{k+1}-E_{k}\leq 0$ , which gives conclusion $(i)$ . Similar argument gives that for $0<\epsilon<2\sqrt{s}\beta-\beta^{2}$ (such $\epsilon$ exists according to assumption $0<\beta<2\sqrt{s}$ )

[TABLE]

After summation of these inequalities, we obtain conclusion $(ii)$ . ∎

Remark 3

In (WRJ, , Theorem 8), the same convergence rate as in Theorem 3.3 on the objective values is obtained for a very different discretization of the system (DIN-AVD) ${}_{\alpha,b\sqrt{s},1+\frac{\alpha\sqrt{s}}{2t}}$ . Their scheme is thus related but quite different from (IGAHD) . Their claims require also intricate conditions relating $(\alpha,b,s,L)$ to hold true.

In Theorem 3.3, the condition $\beta<2\sqrt{s}$ essentially reveals that in order to preserve acceleration offered by the viscous damping, the geometric damping should not be too large. It is an open question whether this constraint is a technical artifact or is fundamental to acceleration. We leave it to a future work.

Remark 4

From $\sum_{k}k^{2}\|\nabla f(x_{k})\|^{2}<+\infty$ we immediately infer that for $k\geq 1$

[TABLE]

A similar argument holds for $y_{k}$ . Hence

[TABLE]

Remark 5

In Theorem 3.3, the convergence property of the values is expressed according to the sequence $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ . It is natural to know if a similar result is true for the sequence $\left({y_{k}}\right)_{k\in{\mathbb{N}}}$ . This is an open question in the case of Nesterov’s accelerated gradient method and the corresponding FISTA algorithm for structured minimization Nest4 ; BT . In the case of the Hessian-driven damping algorithms, we give a partial answer to this question. By the classical descent lemma, and the monotonicity of $\nabla f$ we have

[TABLE]

According to $x_{k+1}=y_{k}-s\nabla f(y_{k})$ we obtain

[TABLE]

From Theorem 3.3 we deduce that

[TABLE]

Remark 6

When $f$ is a proper lower semicontinuous proper function, but not necessarily smooth, we follow the same reasoning as in Section 3.1.2. We consider minimizing the Moreau envelope $f_{\lambda}$ of $f$ , whose gradient is $1/\lambda$ -Lipschitz continuous, and then apply (IGAHD) to $f_{\lambda}$ . We omit the details for the sake of brevity. This observation will be very useful to solve even structured composite problems as we will describe in Section 6.

4 Inertial dynamics for strongly convex functions

4.1 Smooth case

Recall the classical definition of strong convexity:

Definition 1

A function $f:{\mathcal{H}}\to\mathbb{R}$ is said to be $\mu$ -strongly convex for some $\mu>0$ if $f-\frac{\mu}{2}\|\cdot\|^{2}$ is convex.

For strongly convex functions, a suitable choice of $\gamma$ and $\beta$ in (DIN)γ,β provides exponential decay of the value function (hence of the trajectory), and of the gradients.

Theorem 4.1

Suppose that ( $\mathrm{H}$ ) holds where $f:{\mathcal{H}}\to\mathbb{R}$ is in addition $\mu$ -strongly convex for some $\mu>0$ . Let $x(\cdot):[t_{0},+\infty[\to{\mathcal{H}}$ be a solution trajectory of

[TABLE]

Suppose that $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$ . Then, the following hold:

(i)

for all $t\geq t_{0}$

[TABLE]

where $C:=f(x(t_{0}))-\min_{{\mathcal{H}}}f+\mu{\color[rgb]{0,0,0}{\|x(t_{0})-x^{\star}\|^{2}}}+\|\dot{x}(t_{0})+\beta\nabla f(x(t_{0}))\|^{2}.$ 2. (ii)

There exists some constant $C_{1}>0$ such that, for all $t\geq t_{0}$

[TABLE]

Moreover, $\int_{t_{0}}^{\infty}e^{\frac{\sqrt{\mu}}{2}t}\|\dot{x}(t)\|^{2}dt<+\infty.$

When $\beta=0$ , we have $f(x(t))-\min_{{\mathcal{H}}}f={\mathcal{O}}\left(e^{-\sqrt{\mu}t}\right)\,\mbox{ as t }\,\to+\infty.$

Remark 7

When $\beta=0$ , Theorem 4.1 recovers (Siegel, , Theorem 2.2). In the case $\beta>0$ , a result on a related but different dynamical system can be found in (WRJ, , Theorem 1) (their rate is also sligthtly worse than ours). Our gradient estimate is distinctly new in the literature.

Proof

(i)

Let $x^{\star}$ be the unique minimizer of $f$ . Define $\mathcal{E}:[t_{0},+\infty[\to{\mathbb{R}}^{+}$ by

[TABLE]

Set $v(t)=\sqrt{\mu}(x(t)-x^{\star})+\dot{x}(t)+\beta\nabla f(x(t))$ . Derivation of $\mathcal{E}(\cdot)$ gives

[TABLE]

Using (22), we get

[TABLE]

After developing and simplification, we obtain

[TABLE]

By strong convexity of $f$ we have

[TABLE]

Thus, combining the last two relations we obtain

[TABLE]

where (the variable $t$ is omitted to lighten the notation)

[TABLE]

Let us formulate $A$ with $\mathcal{E}(t)$ .

[TABLE]

After developing and simplifying, we obtain

[TABLE]

Since $0\leq\beta\leq\frac{1}{\sqrt{\mu}}$ , we immediately get $\frac{\beta}{\sqrt{\mu}}-\frac{\beta^{2}}{2}\geq\frac{\beta}{2\sqrt{\mu}}.$ Hence

[TABLE]

Let us use again the strong convexity of $f$ to write

[TABLE]

By combining the two inequalities above, we obtain

[TABLE]

where $B=\frac{\mu}{4}\|x(t)-x^{\star}\|^{2}+\frac{\beta}{2\sqrt{\mu}}\|\nabla f(x)\|^{2}-\beta\sqrt{\mu}\|x-x^{\star}\|\|\nabla f(x)\|$ .

Set $X=\|x-x^{\star}\|$ , $Y=\|\nabla f(x)\|$ . Elementary algebraic computation gives that, under the condition $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$

[TABLE]

Hence for $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$

[TABLE]

By integrating the differential inequality above we obtain

[TABLE]

By definition of $\mathcal{E}(t)$ , we infer

[TABLE]

and

[TABLE] 2. (ii)

Set $C=2\mathcal{E}(t_{0})e^{\frac{\sqrt{\mu}}{2}t_{0}}$ . Developing the above expression, we obtain

[TABLE]

By convexity of $f$ we have $\left\langle x(t)-x^{\star},\nabla f(x(t))\right\rangle\geq f(x(t))-f(x^{\star})$ . Moreover,

[TABLE]

Combining the above results, we obtain

[TABLE]

Set $Z(t):=2\beta(f(x(t))-f(x^{\star}))+\sqrt{\mu}\|x(t)-x^{\star}\|^{2}]$ . We have

[TABLE]

By integrating this differential inequality, elementary computation gives

[TABLE]

Noticing that the integral of $e^{\sqrt{\mu}s}$ over $[t_{0},t]$ is of order $e^{\sqrt{\mu}t}$ , the above estimate reflects the fact, as $t\to+\infty$ , the gradient terms $\|\nabla f(x(t))\|^{2}$ tend to zero at exponential rate (in average, not pointwise). ∎

Remark 8

Let us justify the choice of $\gamma=2\sqrt{\mu}$ in Theorem 4.1. Indeed, considering

[TABLE]

a similar proof to that described above can be performed on the basis of the Lyapunov function

[TABLE]

Under the conditions $\gamma\leq\sqrt{\mu}\,\mbox{ and }\,\beta\leq\frac{\mu}{2\gamma^{3}}$ we obtain the exponential convergence rate

[TABLE]

Taking $\gamma=\sqrt{\mu}$ gives the best convergence rate, and the result of Theorem 4.1.

4.2 Non-smooth case

Following AABR , (DIN)γ,β is equivalent to the first-order system

[TABLE]

This permits to extend (DIN)γ,β to the case of a proper lower semicontinuous convex function $f:{\mathcal{H}}\to{\mathbb{R}}\cup\{+\infty\}$ . Replacing the gradient of $f$ by its subdifferential, we obtain its Non-Smooth version :

[TABLE]

Most properties of ${\rm(DIN)}_{\gamma,\beta}$ are still valid for this generalized version. To illustrate it, let us consider the following extension of Theorem 4.1.

Theorem 4.2

Suppose that $f:{\mathcal{H}}\to{\mathbb{R}}\cup\{+\infty\}$ is lower semicontinuous and $\mu$ -strongly convex for some $\mu>0$ . Let $x(\cdot)$ be a trajectory of (DIN-NS) ${}_{2\sqrt{\mu},\beta}$ . Suppose that $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$ . Then

[TABLE]

Proof

Let us introduce $\mathcal{E}:[t_{0},+\infty[\to{\mathbb{R}}^{+}$ defined by

[TABLE]

that will serve as a Lyapunov function. Then, the proof follows the same lines as that of Theorem 4.1, with the use of the derivation rule of Brezis (Bre1, , Lemme 3.3, p. 73).

5 Inertial algorithms for strongly convex functions

We will show in this section that the exponential convergence of Theorem 4.1 for the inertial system (22) translates into linear convergence in the algorithmic case under proper discretization.

5.1 Proximal algorithms

5.1.1 Smooth case

Consider the inertial dynamic (22). Its implicit discretization similar to that performed before gives

[TABLE]

where $h$ is the positive step size. Set $s=h^{2}$ . We obtain the following inertial proximal algorithm with hessian damping (SC refers to Strongly Convex):

[TABLE]

Theorem 5.1

Assume that $f:{\mathcal{H}}\rightarrow{\mathbb{R}}$ is a convex $\mathcal{C}^{1}$ function and $\mu$ -strongly convex, $\mu>0$ , and suppose that

[TABLE]

Set $q={\frac{1}{1+\frac{1}{2}\sqrt{\mu s}}}$ , which satisfies $0<q<1$ . Then, the sequence $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ generated by the algorithm (IPAHD-SC) obeys, for any $k\geq 1$

[TABLE]

where $E_{1}=f(x_{1})-f(x^{\star})+\frac{1}{2}\|\sqrt{\mu}(x_{1}-x^{\star})+\frac{1}{\sqrt{s}}(x_{1}-x_{0})+\beta\nabla f(x_{1})\|^{2}$ . Moreover, the gradients converge exponentially fast to zero: setting $\theta=\frac{1}{1+\sqrt{\mu s}}$ which belongs to $]0,1[$ , we have

[TABLE]

Remark 9

We are not aware of any result of this kind for such a proximal algorithm.

Proof

Let $x^{\star}$ be the unique minimizer of $f$ , and consider the sequence $\left({E_{k}}\right)_{k\in{\mathbb{N}}}$

[TABLE]

where $v_{k}=\sqrt{\mu}(x_{k}-x^{\star})+\frac{1}{\sqrt{s}}(x_{k}-x_{k-1})+\beta\nabla f(x_{k}).$

We will use the following equivalent formulation of the algorithm (IPAHD-SC)

[TABLE]

We have

[TABLE]

Using successively the definition of $v_{k}$ and (24), we get

[TABLE]

Write shortly $B_{k}=\sqrt{\mu}(x_{k+1}-x_{k})+\sqrt{s}\nabla f(x_{k+1})$ . We have

[TABLE]

By virtue of strong convexity of $f$

[TABLE]

Combining the above results, and after dividing by $\sqrt{s}$ , we get

[TABLE]

which gives, after developing and simplification

[TABLE]

According to $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$ , we have $\beta-\frac{\beta^{2}\sqrt{\mu}}{2}\geq\frac{3\beta}{4}$ , which, with Cauchy-Schwarz inequality, gives

[TABLE]

Let us use again the strong convexity of $f$ to write

[TABLE]

Combining the two inequalities above, we get

[TABLE]

Let us rearrange the terms as follows

[TABLE]

Let us examine the sign of the last two terms in the rhs of inequality above.

Term 1

Set $X=\|x_{k+1}-x^{\star}\|$ , $Y=\|\nabla f(x_{k+1})\|$ . Elementary algebra gives that

[TABLE]

holds true under the condition $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$ . Hence, under this condition

[TABLE] 2. Term 2

Set $X=\left\|{x_{k+1}-x_{k}}\right\|$ , $Y=\left\|{\nabla f(x_{k+1})}\right\|$ . Elementary algebra gives

[TABLE]

holds true under the condition $\frac{\sqrt{\mu}}{2s}+\frac{\mu}{\sqrt{s}}\geq\frac{\mu}{\beta}$ . Hence, under this condition

[TABLE]

In turn, the condition $\frac{\sqrt{\mu}}{2s}+\frac{\mu}{\sqrt{s}}\geq\frac{\mu}{\beta}$ is equivalent to $\sqrt{s}\leq\frac{\beta}{2}\left(1+\sqrt{1+\frac{2}{\beta\sqrt{\mu}}}\right).$

Clearly, this condition is satisfied if $\sqrt{s}\leq\beta$ .

Let us put the above results together. We have obtained that, under the conditions $0\leq\beta\leq\frac{1}{2\sqrt{\mu}}$ and $\sqrt{s}\leq\beta$ ,

[TABLE]

Set $q=\frac{1}{1+\frac{1}{2}\sqrt{\mu s}}$ , which satisfies $0<q<1$ . From this, we infer $E_{k}\leq qE_{k-1}$ which gives

[TABLE]

Since $E_{k}\geq f(x_{k})-f(x^{\star})$ , we finally obtain

[TABLE]

Let us now estimate the convergence rate of the gradients to zero. According to the exponential decay of $\left({E_{k}}\right)_{k\in{\mathbb{N}}}$ , as given in (25), and by definition of $E_{k}$ , we have, for all $k\geq 1$

[TABLE]

After developing, we get

[TABLE]

By convexity of $f$ , we have

[TABLE]

Moreover, $\left\langle x_{k}-x_{k-1},x_{k}-x^{\star}\right\rangle\geq\frac{1}{2}\|x_{k}-x^{\star}\|^{2}-\frac{1}{2}\|x_{k-1}-x^{\star}\|^{2}$ .

Combining the above results, we obtain

[TABLE]

Set $Z_{k}:=2\beta(f(x_{k})-f(x^{\star}))+\sqrt{\mu}\|x_{k}-x^{\star}\|^{2}$ . We have, for all $k\geq 1$

[TABLE]

Set $\theta=\frac{1}{1+\sqrt{\mu s}}$ which belongs to $]0,1[$ . Equivalently

[TABLE]

Iterating this linear recursive inequality gives

[TABLE]

Then notice that $\frac{\theta}{q}=\frac{1+\frac{1}{2}\sqrt{\mu s}}{1+\sqrt{\mu s}}<1$ , which gives

[TABLE]

Collecting the above results, we obtain

[TABLE]

Using again the inequality $\theta<q$ , and after reindexing, we finally obtain

[TABLE]

∎

5.1.2 Non-smooth case

Let $f:{\mathcal{H}}\to{\mathbb{R}}\cup\left\{+\infty\right\}$ be a proper, lower semicontinuous and convex function. We argue as in Section 3.1.2 by replacing $f$ with its Moreau envelope $f_{\lambda}$ . The key observation is that the Moreau-Yosida regularization also preserves strong convexity, though with a different modulus as shown by the following result.

Proposition 1

Suppose that $f:{\mathcal{H}}\to{\mathbb{R}}\cup\{+\infty\}$ is a proper, lower semicontinuous convex function. Then, for any $\lambda>0$ and $\mu>0$

[TABLE]

Proof

If $f$ is strongly convex with constant $\mu>0$ , we have $f=g+\frac{\mu}{2}\|\cdot\|^{2}$ for some convex function $g$ . Elementary calculus (see e.g., (BC, , Exercise 12.6)) gives, with $\theta=\frac{\lambda}{1+\lambda\mu}$ ,

[TABLE]

Since $x\mapsto g_{\theta}\left(\frac{1}{1+\lambda\mu}\,x\right)$ is convex, the above formula shows that $f_{\lambda}$ is strongly convex with constant $\frac{\mu}{1+\lambda\mu}$ . ∎

According to the expressions (13) and (14), (IPAHD-SC) becomes with $\theta={\frac{\beta\sqrt{s}+s}{1+2\sqrt{\frac{\mu}{1+\lambda\mu}s}}}$ and $a={\frac{2\sqrt{\frac{\mu}{1+\lambda\mu}s}}{1+2\sqrt{\frac{\mu}{1+\lambda\mu}s}}}$ :

[TABLE]

It is a relaxed inertial proximal algorithm whose coefficients are constant. As a result, its computational burden is equivalent to (actually twice) that of the classical proximal algorithm. A direct application of the conclusions of Theorem 5.1 to $f_{\lambda}$ gives the following statement.

Theorem 5.2

Suppose that $f:{\mathcal{H}}\to{\mathbb{R}}\cup\{+\infty\}$ is a proper, lower semicontinuous and convex function which is $\mu$ -strongly convex for some $\mu>0$ . Take $\lambda>0$ . Suppose that

[TABLE]

Set $q=\displaystyle{\frac{1}{1+\frac{1}{2}\sqrt{\frac{\mu}{1+\lambda\mu}s}}}$ , which satisfies $0<q<1$ . Then, for any sequence $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ generated by algorithm (IPAHD-NS-SC)

[TABLE]

and

[TABLE]

5.2 Inertial gradient algorithms

Let us embark from the continuous dynamic (22) whose linear convergence rate was established in Theorem 4.1. Its explicit time discretization with centered finite differences for speed and acceleration gives

[TABLE]

Equivalently,

[TABLE]

which gives the inertial gradient algorithm with Hessian damping (SC stands for Strongly Convex):

[TABLE]

Let us analyze the linear convergence rate of (IGAHD-SC) .

Theorem 5.3

Let $f:{\mathcal{H}}\to{\mathbb{R}}$ be a $\mathcal{C}^{1}$ and $\mu$ -strongly convex function for some $\mu>0$ , and whose gradient $\nabla f$ is $L$ -Lipschitz continuous. Suppose that

[TABLE]

Set $q=\displaystyle{\frac{1}{1+\frac{1}{2}\sqrt{\mu s}}}$ , which satisfies $0<q<1$ . Then, for any sequence $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ generated by algorithm (IGAHD-SC) , we have

[TABLE]

Moreover, the gradients converge exponentially fast to zero: setting $\theta=\frac{1}{1+\sqrt{\mu s}}$ which belongs to $]0,1[$ , we have

[TABLE]

Remark 10

(IGAHD-SC) can be seen as an extension of the Nesterov accelerated method for strongly convex functions that corresponds to the particular case $\beta=0$ . Actually, in this very specific case, (IGAHD-SC) is nothing but the (HBF) method with stepsize parameter $a=\frac{s}{1+\sqrt{\mu s}}$ and momentum parameter $b=\frac{1-\sqrt{\mu s}}{1+\sqrt{\mu s}}$ ; see (Polyak2, , (2) in Section 3.2). Thus, if $f$ is also of class $\mathcal{C}^{2}$ at $x^{\star}$ , one can obtain linear convergence of the iterates $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ (but not the objective values) from (Polyak2, , Theorem 1) under the assumption that $s<4/L$ (which can be shown to be weaker than (32) since the latter is equivalent for $\beta=0$ to $sL\leq(\sqrt{1-c+c^{2}}-(1-c))^{2}/c\leq 1$ , where $c=\mu/L$ ). 2. 2.

In fact, even for $\beta>0$ , by lifting the problem to the vector $z_{k}=\begin{pmatrix}x_{k}-x^{\star}\\ x_{k-1}-x^{\star}\end{pmatrix}$ as is standard in the (HBF) method, one can write (IGAHD-SC) as

[TABLE]

where $d=\frac{\beta\sqrt{s}}{1+\sqrt{\mu s}}$ . Linear convergence of the iterates $\left({x_{k}}\right)_{k\in{\mathbb{N}}}$ can then be obtained by studying the spectral properties of the above matrix. 3. 3.

For $\beta=0$ , Theorem 5.3 recovers (Siegel, , Theorem 3.2), though the author uses a slightly different discretization, requires only $s\leq 1/L$ and his convergence rate is $\displaystyle{\left({1+\sqrt{\mu s}}\right)}^{-1}$ , which is slightly better than ours for this special case. In the case $\beta>0$ , a result on a scheme related but different from (IGAHD-SC) can be found in (WRJ, , Theorem 3) (their rate is also slightly worse than ours). Our estimate are also new in the literature.

Proof

The proof is based on Lyapunov analysis, and the decrease property at linear rate of the sequence $\left({E_{k}}\right)_{k\in{\mathbb{N}}}$ defined by

[TABLE]

where $x^{\star}$ is the unique minimizer of $f$ , and

[TABLE]

We have $E_{k+1}-E_{k}=f(x_{k+1})-f(x_{k})+\frac{1}{2}\|v_{k+1}\|^{2}-\frac{1}{2}\|v_{k}\|^{2}.$ Using successively the definition of $v_{k}$ and (30), we obtain

[TABLE]

Since $\frac{1}{2}\|v_{k+1}\|^{2}-\frac{1}{2}\|v_{k}\|^{2}=\left\langle v_{k+1}-v_{k},v_{k+1}\right\rangle-\frac{1}{2}\|v_{k+1}-v_{k}\|^{2}$ , we have

[TABLE]

By strong convexity of $f$ and $L$ -Lipschitz continuity of $\nabla f$ we have

[TABLE]

Combining the results above, and after dividing by $\sqrt{s}$ , we get

[TABLE]

Let us make appear $E_{k}$

[TABLE]

After developing and simplification, we get

[TABLE]

Let us majorize this last term by using the Lipschitz continuity of $\nabla f$

[TABLE]

Therefore

[TABLE]

According to $0\leq\beta\leq\frac{1}{\sqrt{\mu}}$ , we have $\beta-\frac{\beta^{2}\sqrt{\mu}}{2}\geq\frac{\beta}{2}$ , which gives

[TABLE]

Let us use again the strong convexity of $f$ to write

[TABLE]

Combining the two above relations we get

[TABLE]

Let us examine the sign of the above quantities: Under the condition $L\leq\frac{\sqrt{\mu}}{8\beta}$ we have $\sqrt{\mu}\frac{\mu}{4}-2\beta\mu L\geq 0$ . Under the condition $L\leq\frac{\frac{\sqrt{\mu}}{2s}+\frac{\mu}{\sqrt{s}}}{2\beta\mu+\frac{1}{\sqrt{s}}+\frac{\sqrt{\mu}}{2}}$ we have $\frac{\sqrt{\mu}}{2s}+\frac{\mu}{\sqrt{s}}-L\left(2\beta\mu+\frac{1}{\sqrt{s}}+\frac{\sqrt{\mu}}{2}\right)\geq 0$ . Therefore, under the above conditions

[TABLE]

Set $q=\frac{1}{1+\frac{1}{2}\sqrt{\mu s}}$ , which satisfies $0<q<1$ . By a similar argument as in Theorem 5.1

[TABLE]

According to the definition of $E_{k}\geq f(x_{k})-f(x^{\star})$ , we finally obtain

[TABLE]

and the linear convergence of $x_{k}$ to $x^{\star}$ and that of the gradients to zero. ∎

6 Numerical results

Here, we illustrate our results on the composite problem on ${\mathcal{H}}={\mathbb{R}}^{n}$ ,

[TABLE]

where $A$ is a linear operator from ${\mathbb{R}}^{n}$ to ${\mathbb{R}}^{m}$ , $m\leq n$ , $g:{\mathbb{R}}^{n}\to{\mathbb{R}}\cup\{+\infty\}$ is a proper lsc convex function which acts as a regularizer. Problem (RLS) is extremely popular in a variety of fields ranging from inverse problems in signal/image processing, to machine learning and statistics. Typical examples of $g$ include the $\ell_{1}$ norm (Lasso), the $\ell_{1}-\ell_{2}$ norm (group Lasso), the total variation, or the nuclear norm (the $\ell_{1}$ norm of the singular values of $x\in{\mathbb{R}}^{N\times N}$ identified with a vector in ${\mathbb{R}}^{n}$ with $n=N^{2}$ ). To avoid trivialities, we assume that the set of minimizers of (RLS) is non-empty.

Though (RLS) is a composite non-smooth problem, it fits perfectly well into our framework. Indeed, the key idea is to appropriately choose the metric. For a symmetric positive definite matrix $S\in{\mathbb{R}}^{n\times n}$ , denote the scalar product in the metric $S$ as $\langle S\cdot,\,\cdot\rangle$ and the corresponding norm as $\left\|{\cdot}\right\|_{S}$ . When $S=I$ , then we simply use the shorthand notation for the Euclidean scalar product $\langle\cdot,\,\cdot\rangle$ and norm $\left\|{\cdot}\right\|$ . For a proper convex lsc function $h$ , we denote $h_{S}$ and $\operatorname{prox}_{h}^{S}$ its Moreau envelope and proximal mapping in the metric $S$ , i.e.

[TABLE]

Similarly, when $S=I$ , we drop $S$ in the above notation.

Let $M=s^{-1}I-A^{*}A$ . With the proviso that $0<s\left\|{A}\right\|^{2}<1$ , $M$ is a symmetric positive definite matrix. It can be easily shown (we provide a proof in Appendix A.2 for completeness; see also the discussion in (chambollereview, , Section 4.6)), that the proximal mapping of $f$ as defined in (RLS) in the metric $M$ is

[TABLE]

which is nothing but the forward-backward fixed-point operator for the objective in (RLS). Moreover, $f_{M}$ is a continuously differentiable convex function whose gradient (again in the metric $M$ ) is given by the standard identity

[TABLE]

and $\left\|{\nabla f_{M}(x)-\nabla f_{M}(z)}\right\|_{M}\leq\left\|{x-z}\right\|_{M}$ , i.e. $\nabla f_{M}$ is Lipschitz continuous in the metric $M$ . In addition, a standard argument shows that

[TABLE]

We are then in position to solve (RLS) by simply applying (IGAHD) (see Section 3.2) to $f_{M}$ . We infer from Theorem 3.3 and properties of $f_{M}$ that

[TABLE]

(IGAHD) and FISTA (i.e. (IGAHD) with $\beta=0$ ) were applied to $f_{M}$ with four instances of $g$ : $\ell_{1}$ norm, $\ell_{1}-\ell_{2}$ norm, the total variation, and the nuclear norm. The results are depicted in Figure 3. One can clearly see that the convergence profiles observed for both algorithms agree with the predicted rate. Moreover, (IGAHD) exhibits, as expected, less oscillations than FISTA, and eventually converges faster.

7 Conclusion, Perspectives

As a guideline to our study, the inertial dynamics with Hessian driven damping give rise to a new class of first-order algorithms for convex optimization. While retaining the fast convergence of the function values reminiscent of the Nesterov accelerated algorithm, they benefit from additional favorable properties among which the most important are:

$\bullet$

fast convergence of gradients towards zero; 2. $\bullet$

global convergence of the iterates to optimal solutions; 3. $\bullet$

extension to the non-smooth setting; 4. $\bullet$

acceleration via time scaling factors.

This article contains the core of our study with a particular focus on the gradient and proximal methods. The results thus obtained pave the way to new research avenues. For instance:

$\bullet$

as initiated in Section 6, apply these results to structured composite optimization problems beyond (RLS) and develop corresponding splitting algorithms; 2. $\bullet$

with the additional gradient estimates, we can expect the restart method to work better with the presence of the Hessian damping term; 3. $\bullet$

deepen the link between our study and the Newton and Levenberg-Marquardt dynamics and algorithms (e.g., ASv ), and with the Ravine method GZ . 4. $\bullet$

the inertial dynamic with Hessian driven damping goes well with tame analysis and Kurdyka-Lojasiewicz property AABR , suggesting that the corresponding algorithms be developed in a non-convex (or even non-smooth) setting.

Appendix A Auxiliary results

A.1 Extended descent lemma

Lemma 1

Let $f:{\mathcal{H}}\to{\mathbb{R}}$ be a convex function whose gradient is $L$ -Lipschitz continuous. Let $s\in]0,1/L]$ . Then for all $(x,y)\in{\mathcal{H}}^{2}$ , we have

[TABLE]

Proof

Denote $y^{+}=y-s\nabla f(y)$ . By the standard descent lemma applied to $y^{+}$ and $y$ , and since $sL\leq 1$ we have

[TABLE]

We now argue by duality between strong convexity and Lipschitz continuity of the gradient of a convex function. Indeed, using Fenchel identity, we have

[TABLE]

$L$ -Lipschitz continuity of the gradient of $f$ is equivalent to $1/L$ -strong convexity of its conjugate $f^{*}$ . This together with the fact that $(\nabla f)^{-1}=\partial f^{*}$ gives for all $(x,y)\in{\mathcal{H}}^{2}$ ,

[TABLE]

Inserting this inequality into the Fenchel identity above yields

[TABLE]

Inserting the last bound into (35) completes the proof.

A.2 Proof of (33)

Proof

We have

[TABLE]

By the Pythagoras relation, we then get

[TABLE]

A.3 Closed-form solutions of (DIN-AVD)α,β,b for quadratic functions

We here provide the closed form solutions to (DIN-AVD)α,β,b for the quadratic objective $f:{\mathbb{R}}^{n}\to\langle Ax,\,x\rangle$ , where $A$ is a symmetric positive definite matrix. The case of a semidefinite positive matrix $A$ can be treated similarly by restricting the analysis to $\ker(A)^{\top}$ . Projecting (DIN-AVD)α,β,b on the eigenspace of $A$ , one has to solve $n$ independent one-dimensional ODEs of the form

[TABLE]

where $\lambda_{i}>0$ is an eigenvalue of $A$ . In the following, we drop the subscript $i$ .

Case $\boldsymbol{\beta(t)\equiv\beta,b(t)=b+\gamma/t,\beta\geq 0,b>0,\gamma\geq 0}$ :

The ODE reads

[TABLE]

$\bullet$

If $\beta^{2}\lambda^{2}\neq 4b\lambda$ : set

[TABLE]

Using the relationship between the Whitaker functions and the Kummer’s confluent hypergeometric functions $M$ and $U$ , see Bateman , the solution to (36) can be shown to take the form

[TABLE]

where $c_{1}$ and $c_{2}$ are constants given by the initial conditions. 2. $\bullet$

If $\beta^{2}\lambda^{2}=4b\lambda$ : set $\zeta=2\sqrt{\lambda\left({\gamma-\alpha\beta/2}\right)}$ . The solution to (36) takes the form

[TABLE]

where $J_{\nu}$ and $Y_{\nu}$ are the Bessel functions of the first and second kind.

When $\beta>0$ , one can clearly see the exponential decrease forced by the Hessian. From the asymptotic expansions of $M$ , $U$ , $J_{\nu}$ and $Y_{\nu}$ for large $t$ , straightforward computations provide the behaviour of $|x(t)|$ for large $t$ as follows:

$\bullet$

If $\beta^{2}\lambda^{2}>4b\lambda$ , we have

[TABLE] 2. $\bullet$

If $\beta^{2}\lambda^{2}<4b\lambda$ , whence $\xi\in i{\mathbb{R}}^{+}_{*}$ and $\kappa\in i{\mathbb{R}}$ , we have

[TABLE] 3. $\bullet$

If $\beta^{2}\lambda^{2}=4b\lambda$ , we have

[TABLE]

Case $\boldsymbol{\beta(t)=t^{\beta},b(t)=ct^{\beta-1},\beta\geq 0,c>0}$ :

The ODE reads now

[TABLE]

Let us make the change of variable $t:=\tau^{\frac{1}{\beta+1}}$ . Let $y(\tau):=x\left({\tau^{\frac{1}{\beta+1}}}\right)$ . By the standard derivation chain rule, it is straightforward to show that $y$ obeys the ODE

[TABLE]

It is clear that this is a special case of (36). Since $\beta$ and $\lambda>0$ , set

[TABLE]

It follows from the first case above that

[TABLE]

Asymptotic estimates can also be derived similarly to above. We omit the details for the sake of brevity.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) F. Álvarez , On the minimizing property of a second-order dissipative system in Hilbert spaces , SIAM J. Control Optim., 38 (2000), No. 4, pp. 1102-1119.
2(2) F. Álvarez, H. Attouch, J. Bolte, P. Redont , A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics , J. Math. Pures Appl., 81 (2002), No. 8, pp. 747–779.
3(3) V. Apidopoulos, J.-F. Aujol, Ch. Dossal , Convergence rate of inertial Forward-Backward algorithm beyond Nesterov’s rule , Math. Program. Ser. B., 180 (2020), pp. 137-?156.
4(4) H. Attouch, A. Cabot , Asymptotic stabilization of inertial gradient dynamics with time-dependent viscosity , J. Differential Equations, 263 (2017), pp. 5412-5458.
5(5) H. Attouch, A. Cabot , Convergence rates of inertial forward-backward algorithms , SIAM J. Optim., 28 (1) (2018), pp. 849–874.
6(6) H. Attouch, A. Cabot, Z. Chbani, H. Riahi , Rate of convergence of inertial gradient dynamics with time-dependent viscous damping coefficient , Evolution Equations and Control Theory, 7 (2018), No. 3, pp. 353–371.
7(7) H. Attouch, Z. Chbani, H. Riahi , Fast proximal methods via time scaling of damped inertial dynamics , SIAM J. Optim., 29 (2019), No. 3, pp. 2227?-2256.
8(8) H. Attouch, Z. Chbani, J. Peypouquet, P. Redont , Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity , Math. Program. Ser. B., 168 (2018), pp. 123–175.