Algorithms of Inertial Mirror Descent in Convex Problems of Stochastic   Optimization

Alexander Nazin

arXiv:1705.01073·math.OC·May 3, 2017·Autom. Remote. Control.

Algorithms of Inertial Mirror Descent in Convex Problems of Stochastic Optimization

Alexander Nazin

PDF

TL;DR

This paper introduces an inertial mirror descent method for convex stochastic optimization problems, extending classical mirror descent with a new approach inspired by the heavy ball method, and provides theoretical error bounds.

Contribution

It proposes a novel inertial mirror descent algorithm that does not require averaging, applicable to convex problems, with proven error bounds and a discrete implementation.

Findings

01

Inertial MD generalizes classical mirror descent.

02

The method achieves a proven upper bound on objective function error.

03

Discrete algorithm implementation is provided.

Abstract

The goal is to modify the known method of mirror descent (MD), proposed by A.S. Nemirovsky and D.B. Yudin in 1979. The paper shows the idea of a new, so-called inertial MD method with the example of a deterministic optimization problem in continuous time. In particular, in the Euclidean case, the heavy ball method by B.T. Polyak is realized. It is noted that the new method does not use additional averaging. A discrete algorithm of inertial MD is described. The theorem on the upper bound on the error in the objective function is proved.

Equations56

\dot{ζ} (t)

\dot{ζ} (t)

μ_{t} \overset{x}{˙} (t) + x (t)

V (x) = ζ \in R^{N} sup {⟨ ζ, x ⟩ - W (ζ)} .

V (x) = ζ \in R^{N} sup {⟨ ζ, x ⟩ - W (ζ)} .

\overset{x}{˙} (t) = - \nabla f (x (t)), t \geq 0.

\overset{x}{˙} (t) = - \nabla f (x (t)), t \geq 0.

μ \overset{x}{¨} (t) + \overset{x}{˙} (t) = - \nabla f (x (t)), t \geq 0.

μ \overset{x}{¨} (t) + \overset{x}{˙} (t) = - \nabla f (x (t)), t \geq 0.

W_{*} (ζ) = W (ζ) - ⟨ ζ, x^{*} ⟩, ζ \in R^{N},

W_{*} (ζ) = W (ζ) - ⟨ ζ, x^{*} ⟩, ζ \in R^{N},

\frac{d}{d t} W_{*} (ζ (t))

\frac{d}{d t} W_{*} (ζ (t))

\leq

\int_{0}^{t}f(x(t))dt\,-\,tf^{*}\leq-W_{*}(\zeta(t))-\mu_{t}[f(x(t))-f^{*}]\Big{|}_{0}^{t}+\int_{0}^{t}[f(x(t))-f^{*}]\dot{\mu}_{t}dt,

\int_{0}^{t}f(x(t))dt\,-\,tf^{*}\leq-W_{*}(\zeta(t))-\mu_{t}[f(x(t))-f^{*}]\Big{|}_{0}^{t}+\int_{0}^{t}[f(x(t))-f^{*}]\dot{\mu}_{t}dt,

\int_{0}^{t} f (x (t)) d t - t f^{*}

\int_{0}^{t} f (x (t)) d t - t f^{*}

μ_{0} = 0, \overset{μ}{˙}_{t} \leq 1 \forall t > 0,

μ_{0} = 0, \overset{μ}{˙}_{t} \leq 1 \forall t > 0,

f (x (t)) - f^{*} \leq V (x^{*}) / μ_{t} .

f (x (t)) - f^{*} \leq V (x^{*}) / μ_{t} .

μ_{t} = t, t \geq 0.

μ_{t} = t, t \geq 0.

\dot{ζ} (t)

\dot{ζ} (t)

t \overset{x}{˙} (t) + x (t)

f (x (t)) - f^{*} \leq V (x^{*}) t^{- 1}, \forall t > 0 .

f (x (t)) - f^{*} \leq V (x^{*}) t^{- 1}, \forall t > 0 .

f (x) ≜ E Q (x, Z) \to x \in X min,

f (x) ≜ E Q (x, Z) \to x \in X min,

u_{k} (x) = \nabla_{x} Q (x, Z_{k}), k = 1, 2, \dots,

u_{k} (x) = \nabla_{x} Q (x, Z_{k}), k = 1, 2, \dots,

E u_{k} (x) \in \partial f (x) .

E u_{k} (x) \in \partial f (x) .

∥\nabla W_{β} (ζ) - \nabla W_{β} (\tilde{ζ}) ∥ \leq \frac{1}{α β} ∥ ζ - \tilde{ζ} ∥_{*}, \forall ζ, \tilde{ζ} \in E^{*}, β > 0,

∥\nabla W_{β} (ζ) - \nabla W_{β} (\tilde{ζ}) ∥ \leq \frac{1}{α β} ∥ ζ - \tilde{ζ} ∥_{*}, \forall ζ, \tilde{ζ} \in E^{*}, β > 0,

τ_{t}

τ_{t}

ζ_{t}

τ_{t} \frac{x _{t} - x _{t - 1}}{γ _{t + 1}} + x_{t}

W_{β} (ζ) = x \in X sup {- ζ^{T} x - β V (x)}, ζ \in E^{*},

W_{β} (ζ) = x \in X sup {- ζ^{T} x - β V (x)}, ζ \in E^{*},

x_{t} = \frac{τ _{t}}{τ _{t} + γ _{t + 1}} x_{t - 1} - \frac{γ _{t + 1}}{τ _{t} + γ _{t + 1}} \nabla W_{β_{t}} (ζ_{t}) .

x_{t} = \frac{τ _{t}}{τ _{t} + γ _{t + 1}} x_{t - 1} - \frac{γ _{t + 1}}{τ _{t} + γ _{t + 1}} \nabla W_{β_{t}} (ζ_{t}) .

γ_{i} \equiv 1, β_{i} = β_{0} i + 1, i = 1, 2, \dots, β_{0} > 0.

γ_{i} \equiv 1, β_{i} = β_{0} i + 1, i = 1, 2, \dots, β_{0} > 0.

ζ_{t}

ζ_{t}

x_{t}

x \in X sup E ∥ \nabla_{x} Q (x, Z) ∥_{*}^{2} \leq L_{X, Q}^{2},

x \in X sup E ∥ \nabla_{x} Q (x, Z) ∥_{*}^{2} \leq L_{X, Q}^{2},

E f (x_{t}) - x \in X min f (x) \leq (β_{0} V (x^{*}) + \frac{L _{X, Q}^{2}}{α β _{0}}) \frac{t + 2}{t + 1} .

E f (x_{t}) - x \in X min f (x) \leq (β_{0} V (x^{*}) + \frac{L _{X, Q}^{2}}{α β _{0}}) \frac{t + 2}{t + 1} .

E f (x_{t}) - x \in X min f (x) \leq 2 L_{X, Q} (α^{- 1} \overline{V})^{1/2} \frac{t + 2}{t + 1} .

E f (x_{t}) - x \in X min f (x) \leq 2 L_{X, Q} (α^{- 1} \overline{V})^{1/2} \frac{t + 2}{t + 1} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Workshop “Optimization and Statistical Learning”

April 10–14, 2017, Les Houches, France

**Algorithms of Inertial Mirror Descent

in Convex Problems of Stochastic Optimization111The full paper is accepted at Russian journal Automatika i Telemekhanika which would be translated as Automation and Remote Control. **

Alexander Nazin

(April 12, 2017)

Abstract

The goal is to modify the known method of mirror descent (MD), proposed by A.S. Nemirovsky and D.B. Yudin in 1979. The paper shows the idea of a new, so-called inertial MD method with the example of a deterministic optimization problem in continuous time. In particular, in the Euclidean case, the heavy ball method by B.T. Polyak is realized. It is noted that the new method does not use additional averaging. A discrete algorithm of inertial MD is described. The theorem on the upper bound on the error in the objective function is proved.

ICS RAS, Moscow, Russia

1. The idea of method of inertial mirror descent

Let $f:\mathbb{R^{N}}\to\mathbb{R}$ be convex, differentiable function having a unique minimum point $x^{*}\in\mathrm{Argmin}f(x)$ and its minimal value $f^{*}=f(x^{*})$ . Consider continuous algorithm which extends MDM that is

[TABLE]

Functional parameter in (2) is a convex, continuously differentiable function $W:\mathbb{R^{N}}\to\mathbb{R}_{+}$ having conjugate function

[TABLE]

Let $W(0)=0$ , $V(0)=0$ , and $\nabla W(0)=0$ for simplicity.

Remark 1

Under parameter $\mu_{t}\equiv 0$ in (2), algorithm (1)–(2) represents MDM (in continuous time) [1]; in particular, the identical map $\nabla W(\zeta)\equiv\zeta$ and $\mu_{t}\equiv 0$ lead to a standard gradient method

[TABLE]

Under $\mu_{t}\equiv\mu>0$ and $W(\zeta)\equiv\zeta$ , algorithm (1)–(2) leads to continuous method of heavy ball (MHB) [9]

[TABLE]

$\square$ *

Further, we assume that differentiable parameter $\mu_{t}\geq 0$ , and method (1)–(2) we call Method of Inertial Mirror Descent (MIDM).

Assume a solution $\{x(t)$ , $t\geq 0\}$ to system equations (1)–(2) exists.

Consider function

[TABLE]

attempting to find a candidate Lyapunov function.

Trajectory derivative to system (1)–(2) be

[TABLE]

where last inequality results from convexity $f(\cdot)$ . Now, integrating on interval $[0,t]$ with $W_{*}(0)=0$ , we obtain

[TABLE]

where two last terms in RHS got by integrating in parts. Taking (3) into account, we continue (7):

[TABLE]

Therefore, it is reasonable to introduce the following constraints on patameter $\mu_{t}\geq 0$ :

[TABLE]

leading to inequality

[TABLE]

Maximizing $\mu_{t}$ under constraints (9) we get

[TABLE]

The related (continuous) IMD algorithm

[TABLE]

proves upper bound

[TABLE]

2. Stochastic optimization problem

Consider minimization problem

[TABLE]

where loss function $Q:X\times\mathcal{Z}\to\mathbb{R}_{+}$ contains random variable $Z$ with unknown distribution on space $\mathcal{Z}$ , $\mathbb{E}$ — mathematical expectation, set $X\subset{\mathbb{R}}^{N}$ — given convex compact in $N$ -dimension space, random function $Q(\cdot\,,Z):X\to\mathbb{R}_{+}$ is convex a.s. on $X$ .

Let i.i.d sample $(Z_{1},\dots,Z_{t-1})$ be given where all $Z_{i}$ have the same distribution on $\mathcal{Z}$ as $Z$ . Introduce notation for stochastic subgradients

[TABLE]

such that $\forall x\in X$ ,

[TABLE]

The goal is in constructing and proving novel recursive MD algorithms meant for minimization (14) and using stochastic subgradients $u_{t}(x_{t-1})$ (15) at current points $x=x_{t-1}\in X$ , $t=1,2,\dots$ .

3. Algorithm IMD. Main results.

Let $\|\cdot\|$ be a norm in primal space $E=\mathbb{R}^{N}$ , and $\|\cdot\|_{*}$ be the related norm in dual space $E^{*}=\mathbb{R}^{N}$ ; set $X\subset E$ is convex compact.

Assumption (L). Convex function $V:{X}\to\mathbb{R}_{+}$ is such that its $\beta$ -conjugate $W_{\beta}$ is continuously differential on $E^{*}$ with gradient $\nabla W_{\beta}$ satisfying Lipschitz condition

[TABLE]

where $\alpha$ is positive constant being independent of $\beta$ .

Consider now the discrete time $t\in Z_{+}$ . Write a discrete version of algorithm IMD (11)–(12) using stochastic subgradients (15) instead of the gradients $\nabla f(\cdot)$ :

[TABLE]

Here function $W_{\beta}$ is defined by proxy-function $V:X\to\mathbb{R}_{+}$ via Legendre–Fenchel transformation, i.e.

[TABLE]

Remark 2

Equation (18) may be written as

[TABLE]

Since the vectors $[-\nabla W_{\beta_{t}}(\zeta_{t})]\in X$ under each $t\geq 0$ , equations (16)–(17) show that $x_{t}\in X$ by induction. $\Box$

Further, let sequences $(\gamma_{i})_{i\geq 1}$ and $(\beta_{i})_{i\geq 1}$ are of view

[TABLE]

Then system equations (16)–(18) leads to the IMD algorithm:

[TABLE]

Theorem 1

Let ${X}$ be convex closed set in $\mathbb{R}^{N}$ , and loss function $Q(\cdot,\cdot)$ satisfies the conditions of section 2, and, moreover,

[TABLE]

where constant $L_{{X},\,Q}\in(0,\infty)$ . Let $V$ be proxy-function on ${X}$ with parameter $\alpha>0$ from assumption (L), and let exists minimum point ${x}^{*}\in\displaystyle\mathop{\mathrm{Arg}\!\min}_{{x}\in{X}}f({x})$ . Then for any $t\geq 1$ estimate ${x}_{t}$ , defined by algorithm (22), (23) with stochastic subgradients (15) and sequence $(\beta_{i})_{i\geq 1}$ from (21) with arbitrary $\beta_{0}>0$ , satisfies inequality

[TABLE]

$\Box$ **

Corollary 1

If constant $\overline{V}$ in Theorem 1 assumptions is such that $V({x}^{*})\leq\overline{V}$ and $\beta_{0}=L_{{X},\,Q}\,(\alpha\,\overline{V}\,)^{-1/2}$ then

[TABLE]

In particular, one may get $\overline{V}=\displaystyle\max_{{x}\in{X}}V({x})$ . $\Box$

Bibliography9

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Nemirovskii, A.S. and Yudin, D.B., Problem Complexity and Method Efficiency in Optimization , Chichester: Wiley, 1983.
2[2] A. Ben-Tal, T. Margalit, A. Nemirovski. The ordered subsets mirror descent optimization method with applications to tomography. SIOPT 12(1), 79–108, 2001.
3[3] A. Beck, M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31(3), 167–175, 2003.
4[4] Yu. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming , 2007. DOI: 10.1007/s 10107-007-0149-x.
5[5] A.B. Juditsky, A.V. Nazin, A.B. Tsybakov, and N. Vayatis. Recursive aggregation of estimators by the mirror descent algorithm with averaging. Problems of Information Transmission , 41(4):368–384, 2005.
6[6] Nemirovski A., Juditsky A., Lan G. and Shapiro A. Robust stochastic approximation approach to stochastic programming // SIAM J. Optim. 2009. V. 19. No. 4. P. 1574–1609.
7[7] Rockafellar R.T., Wets R.J.B. Variational Analysis. N.-Y.: Springer, 1998.
8[8] Polyak B.T. Some methods of speeding up the convergence of iteration methods // Zh. Vych. Mat., 4 , No. 5, 791 -803, 1964.