Adaptive Mirror Descent for Constrained Optimization

Anastasia Bayandina

arXiv:1705.02029·math.OC·May 8, 2017

Adaptive Mirror Descent for Constrained Optimization

Anastasia Bayandina

PDF

TL;DR

This paper introduces an adaptive Mirror Descent method for constrained convex optimization that improves convergence rates and can generate dual solutions, especially effective for strongly convex problems.

Contribution

The paper proposes an adaptive stepsize Mirror Descent algorithm with enhanced convergence rates and dual solution generation capabilities for constrained convex optimization.

Findings

01

Improved convergence rate over fixed stepsize MD

02

Method generates dual solutions for certain constraints

03

Effective restart technique for strongly convex problems

Abstract

This paper seeks to address how to solve non-smooth convex and strongly convex optimization problems with functional constraints. The introduced Mirror Descent (MD) method with adaptive stepsizes is shown to have a better convergence rate than MD with fixed stepsizes due to the improved constant. For certain types of constraints, the method is proved to generate dual solution. For the strongly convex case, the restart technique is applied.

Equations99

\lVert\xi\rVert_{*}=\max\limits_{x}\big{\{}\langle\xi,x\rangle,\lVert x\rVert\leq 1\big{\}}.

\lVert\xi\rVert_{*}=\max\limits_{x}\big{\{}\langle\xi,x\rangle,\lVert x\rVert\leq 1\big{\}}.

\forall x, y \in X \exists\nabla f (x) :

\forall x, y \in X \exists\nabla f (x) :

f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩, ∥ \nabla f (x) ∥_{*} < \infty

f (x) \to x \in X min,

f (x) \to x \in X min,

s.t. g (x) \leq 0.

s.t. g (x) \leq 0.

\forall x, y, \in X ⟨ d^{'} (x) - d^{'} (y), x - y ⟩ \geq ∥ x - y ∥^{2},

\forall x, y, \in X ⟨ d^{'} (x) - d^{'} (y), x - y ⟩ \geq ∥ x - y ∥^{2},

x \in X min d (x) = d (0) .

x \in X min d (x) = d (0) .

d (x_{*}) \leq Θ_{0}^{2} .

d (x_{*}) \leq Θ_{0}^{2} .

x_{*} \in X_{*} min d (x_{*}) \leq Θ_{0}^{2} .

x_{*} \in X_{*} min d (x_{*}) \leq Θ_{0}^{2} .

V (x, y) = d (y) - d (x) - ⟨ d^{'} (x), y - x ⟩ .

V (x, y) = d (y) - d (x) - ⟨ d^{'} (x), y - x ⟩ .

\mathrm{Mirr}_{x}(y)=\arg\min\limits_{u\in\mathcal{X}}\big{\{}\langle y,u\rangle+V(x,u)\big{\}}.

\mathrm{Mirr}_{x}(y)=\arg\min\limits_{u\in\mathcal{X}}\big{\{}\langle y,u\rangle+V(x,u)\big{\}}.

x^{i + 1} = Mirr_{x^{i}} (h_{i} \nabla f (x^{i})) .

x^{i + 1} = Mirr_{x^{i}} (h_{i} \nabla f (x^{i})) .

\begin{gathered}h_{i}\big{(}f(x^{i})-f(x)\big{)}\leq\\ \frac{h_{i}^{2}}{2}\lVert\nabla f(x^{i})\rVert^{2}_{*}+V(x^{i},x)-V(x^{i+1},x).\end{gathered}

\begin{gathered}h_{i}\big{(}f(x^{i})-f(x)\big{)}\leq\\ \frac{h_{i}^{2}}{2}\lVert\nabla f(x^{i})\rVert^{2}_{*}+V(x^{i},x)-V(x^{i+1},x).\end{gathered}

f (\overset{x}{ˉ}^{N}) - f (x_{*}) \leq ε, g (\overset{x}{ˉ}^{N}) \leq ε

f (\overset{x}{ˉ}^{N}) - f (x_{*}) \leq ε, g (\overset{x}{ˉ}^{N}) \leq ε

N=\Big{\lceil}\frac{2M^{2}\Theta_{0}^{2}}{\varepsilon^{2}}\Big{\rceil},

N=\Big{\lceil}\frac{2M^{2}\Theta_{0}^{2}}{\varepsilon^{2}}\Big{\rceil},

\frac{N}{M ^{2}} = i \in [N] \sum \frac{1}{M _{i}^{2}} .

\frac{N}{M ^{2}} = i \in [N] \sum \frac{1}{M _{i}^{2}} .

\sum\limits_{i\in I}h_{i}f(\bar{x}^{N})-f(x_{*})\leq\sum\limits_{i\in I}h_{i}\big{(}f(x^{i})-f(x_{*})\big{)}.

\sum\limits_{i\in I}h_{i}f(\bar{x}^{N})-f(x_{*})\leq\sum\limits_{i\in I}h_{i}\big{(}f(x^{i})-f(x_{*})\big{)}.

\displaystyle\sum\limits_{i\in I}h_{i}\big{(}f(x^{i})-f(x_{*})\big{)}+\sum\limits_{i\in J}h_{i}\big{(}g(x^{i})-g(x_{*})\big{)}\leq

\displaystyle\sum\limits_{i\in I}h_{i}\big{(}f(x^{i})-f(x_{*})\big{)}+\sum\limits_{i\in J}h_{i}\big{(}g(x^{i})-g(x_{*})\big{)}\leq

i \in I \sum \frac{h _{i}^{2} M _{i}^{2}}{2} + i \in J \sum \frac{h _{i}^{2} M _{i}^{2}}{2} +

\displaystyle\sum\limits_{i\in[N]}\big{(}V(x^{i},x_{*})-V(x^{i+1},x_{*})\big{)}\leq\frac{\varepsilon}{2}\sum\limits_{i\in[N]}h_{i}+\Theta_{0}^{2}.

g (x^{i}) - g (x_{*}) \geq g (x^{i}) > ε,

g (x^{i}) - g (x_{*}) \geq g (x^{i}) > ε,

\displaystyle\sum\limits_{i\in I}h_{i}\big{(}f(\bar{x}^{N})-f(x_{*})\big{)}<\frac{\varepsilon}{2}\sum\limits_{i\in[N]}h_{i}-\varepsilon\sum\limits_{i\in J}h_{i}+\Theta_{0}^{2}=

\displaystyle\sum\limits_{i\in I}h_{i}\big{(}f(\bar{x}^{N})-f(x_{*})\big{)}<\frac{\varepsilon}{2}\sum\limits_{i\in[N]}h_{i}-\varepsilon\sum\limits_{i\in J}h_{i}+\Theta_{0}^{2}=

ε i \in I \sum h_{i} - \frac{ε ^{2}}{2} i \in [N] \sum \frac{1}{M _{i}^{2}} + Θ_{0}^{2} \leq ε i \in I \sum h_{i} .

i \in I \sum h_{i} g (\overset{x}{ˉ}^{N}) \leq i \in I \sum h_{i} g (x^{i}) \leq ε i \in I \sum h_{i} .

i \in I \sum h_{i} g (\overset{x}{ˉ}^{N}) \leq i \in I \sum h_{i} g (x^{i}) \leq ε i \in I \sum h_{i} .

\forall x, y, \in X f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{μ}{2} ∥ y - x ∥^{2},

\forall x, y, \in X f (y) \geq f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{μ}{2} ∥ y - x ∥^{2},

\forall x \in X : ∥ x ∥ \leq 1 d (x) \leq \frac{ω _{n}}{2},

\forall x \in X : ∥ x ∥ \leq 1 d (x) \leq \frac{ω _{n}}{2},

∥ x_{0} - x_{*} ∥^{2} \leq R_{0}^{2}, where x_{0} = ar g x \in X min d (x) .

∥ x_{0} - x_{*} ∥^{2} \leq R_{0}^{2}, where x_{0} = ar g x \in X min d (x) .

f (x) - f (x_{*}) \leq ε, g (x) \leq ε,

f (x) - f (x_{*}) \leq ε, g (x) \leq ε,

\frac{μ}{2} ∥ x - x_{*} ∥ \leq ε .

\frac{μ}{2} ∥ x - x_{*} ∥ \leq ε .

f (x_{K}) - f (x_{*}) \leq ε, g (x_{K}) \leq ε

f (x_{K}) - f (x_{*}) \leq ε, g (x_{K}) \leq ε

N=N_{1}+\dots+N_{K}=\Big{\lceil}\frac{4M^{2}\omega_{n}}{\mu\varepsilon}\Big{\rceil},

N=N_{1}+\dots+N_{K}=\Big{\lceil}\frac{4M^{2}\omega_{n}}{\mu\varepsilon}\Big{\rceil},

\frac{N}{M ^{2}} = i \in [N_{1}] \sum \frac{1}{M _{i}^{2}} + \dots + i \in [N_{K}] \sum \frac{1}{M _{i}^{2}} .

\frac{N}{M ^{2}} = i \in [N_{1}] \sum \frac{1}{M _{i}^{2}} + \dots + i \in [N_{K}] \sum \frac{1}{M _{i}^{2}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive Mirror Descent

for Constrained Optimization

Anastasia Bayandina

Moscow Institute of Physics and Technology

Moscow, Russia

Email: [email protected]

Abstract

This paper seeks to address how to solve non-smooth convex and strongly convex optimization problems with functional constraints. The introduced Mirror Descent (MD) method with adaptive stepsizes is shown to have a better convergence rate than MD with fixed stepsizes due to the improved constant. For certain types of constraints, the method is proved to generate dual solution. For the strongly convex case, the ’restart’ technique is applied.

I Introduction

Optimizing non-smooth functions with constraints is attracting widespread interest in large-scale optimization and its applications [1], [2]. There are various methods of solving this kind of optimization problems. The examples of these methods are: bundle-level method [3], penalty method [4], [5], Lagrange multipliers method [6]. Among them, Mirror Descent (MD) [7], [8] is viewed as a simple method for non-smooth convex optimization.

In this paper, it is proposed to modify MD so that the stepsizes along with the rate of convergence are no more dependent on the global Lipschitz constant [10], but rather on the sizes of the gradients in current points. These sizes are averaged in some sense and substitute the Lipschitz constant. If the constraints can be represented as the maximum of convex functions, which often arises in applications with maximum of many scalar constraints, it is possible to build up the dual solution using the proposed method. The idea of restarts [11] is adopted to construct the algorithm in the case of strongly convex objective and constraints. Both proposed methods are optimal in terms of the lower bounds [7].

The paper is organized as follows: in Section II we state the problem and notation; in Section III we describe the MD algorithm with adaptive stepsizes and prove the convergence theorem for it; Section IV is focused on the strongly convex case with restarting MD algorithm and theoretical estimates of its convergence; finally, Section V is about duality of the proposed MD method.

II Preliminaries and Problem Statement

Let $E$ be the $n$ -dimensional vector space. Let $\lVert\cdot\rVert$ be an arbitrary norm in $E$ and $\lVert\cdot\rVert_{*}$ be the conjugate norm in $E^{*}$ :

[TABLE]

Let $\mathcal{X}\subset E$ be a closed convex set. We consider the two convex functions $f:\mathcal{X}\rightarrow\mathbb{R}$ and $g:\mathcal{X}\rightarrow\mathbb{R}$ to be subdifferentiable and Lipschitz continuous, i.e.

[TABLE]

and the same goes for $g$ .

We focus on the problem expressed in the form

[TABLE]

Denote $x_{*}$ to be the genuine solution of the problem (1), (2).

Assume that we are equipped with the first-order oracle, which given the point $x\in\mathcal{X}$ returns the values of $\nabla f(x),\nabla g(x),$ and $g(x)$ .

Consider $d:\mathcal{X}\rightarrow\mathbb{R}$ to be a distance generating function (d.g.f) which is continuously differentiable and strongly convex, modulus 1, w.r.t. the norm $\lVert\cdot\rVert$ , i.e.

[TABLE]

and assume that

[TABLE]

Suppose we are given a constant $\Theta_{0}$ such that

[TABLE]

Note that if there is a set of optimal points $\mathcal{X}_{*}$ , than we may assume that

[TABLE]

For all $x,y\in\mathcal{X}$ consider the corresponding Bregman divergence

[TABLE]

For all $x\in\mathcal{X}$ , $y\in E^{*}$ define the proximal mapping operator

[TABLE]

We make the simplicity assumption, which means that $\mathrm{Mirr}_{x}(y)$ is easily computable.

III Mirror Descent for Constrained Optimization

The following algorithm is proposed to solve the problem (1), (2).

Denote $[N]=\{i\in\overline{0,N-1}\}$ , $J=[N]/I$ .

We are going to adopt the following lemma [9].

Lemma 1

Let $f$ be some convex subdifferentiable function over the convex set $\mathcal{X}$ . Let the sequence $\{x^{i}\}$ be defined by the update

[TABLE]

Then, for any $x\in\mathcal{X}$

[TABLE]

Theorem 1

The point $\bar{x}^{N}$ supplied by Algorithm 1 satisfies

[TABLE]

for the number of oracle calls equal to

[TABLE]

where $M$ is found from

[TABLE]

Proof:

By the definition of $\bar{x}^{N}$ and the convexity of $f$ ,

[TABLE]

Using (3) and the definitions of the stepsizes, consider the summation

[TABLE]

Since for $i\in J$

[TABLE]

recalling (7), we get

[TABLE]

As long as the inequality is strict, the case of the empty $I$ is impossible.

For $i\in I$ holds $g(x^{i})\leq\varepsilon$ . Then, by the definition of $\bar{x}^{N}$ and the convexity of $g$ ,

[TABLE]

∎

It is worth mentioning that the constant $M$ is somewhat the average of all subgradient norms in particular points instead of being the Lipschitz constant biggest possible over the set $\mathcal{X}$ .

IV Restarting Mirror Descent

In this section we assume that $f$ and $g$ in the problem (1), (2) are $\mu$ -strongly convex on $\mathcal{X}$ , i.e.

[TABLE]

and the same goes for $g$ .

Also the d.g.f is assumed to be bounded on the unit ball, that is,

[TABLE]

where $\omega_{n}$ is some dimension-dependent constant which in most setups asymptotically behaves as $O\big{(}\log(n)\big{)}$ [9].

Suppose we are given a constant $R_{0}$ such that

[TABLE]

The following algorithm is proposed to solve the problem (1), (2) in the case of strong convexity [11].

At each iteration $k$ of the loop the algorithm performs the restart: it calls the $\mathrm{MD}$ procedure described in the previous section with some accuracy $\varepsilon_{k}$ which becomes smaller for each next restart.

Denote by $N_{1},\dots,N_{K}$ the numbers of oracle calls at each restart in Algorithm 2 and by $[N_{1}],\dots,[N_{K}]$ the corresponding sets of indices.

Further for the sake of brevity we accept the following statement without proof.

Lemma 2

Suppose $f$ and $g$ are $\mu$ -strongly convex functions w.r.t. the norm $\lVert\cdot\rVert$ and $x_{*}$ is the genuine solution of the problem (1), (2). Then if for some $x\in\mathcal{X}$

[TABLE]

then

[TABLE]

Now we are ready to prove the following

Theorem 2

The point $x_{K}$ supplied by Algorithm 2 satisfies

[TABLE]

for the total number of oracle calls equal to

[TABLE]

where $M$ is found from

[TABLE]

Proof:

Observe [10] that the function $d_{k-1}(x)$ defined in Algorithm 2 is 1-strongly convex w.r.t. the norm $\lVert\cdot\rVert/R_{k-1}$ . The conjugate of this norm is $R_{k-1}\lVert\cdot\rVert_{*}$ . It means that at each restart the actual Lipschitz constants are $M_{i}R_{k-1}$ . Then, by (9) and (10) at the end of the first restart we obtain

[TABLE]

which by Theorem 1 guarantees the $\varepsilon_{1}$ -solution of the problem.

Further, by Lemma 2, after the $(k-1)$ th restart it holds that

[TABLE]

Due to the choice of the d.g.f. $d_{k-1}(x)$ , the starting point of the $k$ th restart is $x_{k-1}$ and

[TABLE]

In that way we have justified the redefinition of the d.g.f. and the ’distance’ argument of the $\mathrm{MD}$ procedure.

After the $k$ th restart by the definition of $R_{k}$ and $\varepsilon_{k}$ we obtain

[TABLE]

Thus, for the whole $\mathrm{RestartMD}$ procedure considering the definition of $K$

[TABLE]

∎

Note that due to Lemma 2 the argument $x_{k}$ converges to $x_{*}$ along with the function, which is a typical property of strongly convex optimization.

V Dual Problem Solution

Following [12], in this section we regard the problem of the type (1), (2) where the constraints appear in the form

[TABLE]

Consider the dual problem

[TABLE]

Denote $\lambda_{*}=(\lambda_{*1},\dots,\lambda_{*\mathcal{M}})$ to be the genuine solution of (15). Then, by the weak duality property [6] we have

[TABLE]

Assume that Slater’s condition holds, i.e. there exists $x\in\mathcal{X}$ such that $g(x)<0$ . This ensures strong duality $\Delta(x_{*},\lambda_{*})=0$ . It means that if the algorithm is able to generate the dual solution $\bar{\lambda}^{N}=(\bar{\lambda}_{1}^{N},\dots,\bar{\lambda}_{\mathcal{M}}^{N})$ of the problem (1), (2) with (14), the accuracy of this solution can be estimated via the size of the duality gap $\Delta(\bar{x}^{N},\bar{\lambda}^{N})$ .

As long as the constraints are of the form (14), we can define the function

[TABLE]

Theorem 3

Consider Algorithm 1 and define dual Lagrange multipliers as

[TABLE]

where

[TABLE]

Then, the point $\bar{x}^{N}$ supplied by Algorithm 1 satisfies

[TABLE]

for the number of oracle calls equal to

[TABLE]

where $M$ is found from

[TABLE]

Proof:

Combining (7) and (8) together with (16) we obtain

[TABLE]

Recalling (17) and rearranging the terms,

[TABLE]

∎

VI Conclusion

We proved MD algorithm with adaptive stepsizes to achieve optimal rates in both convex and strongly convex cases with the improved Lipschitz constant. For the problems with constraints in the form of maximum of convex functions we showed the duality of the method. However, it still remains open whether it is possible to construct high probability bounds for adaptive steps in the case of stochastic oracle.

Acknowledgment

The author gratefully acknowledges the help and valuable discussion kindly provided by Dr. Gasnikov.

This research was funded by Russian Science Foundation (project 17-11-01027).

Bibliography12

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Shpirko and Yu. Nesterov, ”Primal-dual Subgradient Methods for Huge-scale Linear Conic Problem”, SIAM Journal on Optimization , no. 24, pp. 1444-1457, 2014.
2[2] A. Ben-Tal and A. Nemirovski, ”Robust Truss Topology Design via Semidefinite Programming”, in SIAM J. Optim. , vol. 7, no. 4, pp. 991-1016, Nov., 1997.
3[3] Yu. Nesterov, Introduction to Convex Optimization . Moscow, Russia: MCCME, 2010.
4[4] F. Vasilyev, Optimization Methods . Moscow, Russia: FP, 2002.
5[5] G. Lan, ”Gradient Sliding for Composite Optimization”, Math. Program. , vol.159, no.1-2, pp. 201-235, 2016.
6[6] S. Boyd and L. Vandenberghe, Convex Optimization . New York, NY: Cambridge University Press, 2004.
7[7] A. Nemirovski and D. Yudin, Problem Complexity and Method Efficiency in Optimization . New York, NY: Wiley, 1983.
8[8] A. Beck and M. Teboulle, ”Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization”, in Operations Research Letters , vol. 31, pp. 167-175, 2003.