Inexact Model: A Framework for Optimization and Variational Inequalities

Fedor Stonyakin; Alexander Gasnikov; Alexander Tyurin; Dmitry; Pasechnyuk; Artem Agafonov; Pavel Dvurechensky; Darina Dvinskikh; Alexey; Kroshnin; Victorya Piskunova

arXiv:1902.00990·math.OC·January 7, 2020·Optim. Methods Softw.

Inexact Model: A Framework for Optimization and Variational Inequalities

Fedor Stonyakin, Alexander Gasnikov, Alexander Tyurin, Dmitry, Pasechnyuk, Artem Agafonov, Pavel Dvurechensky, Darina Dvinskikh, Alexey, Kroshnin, Victorya Piskunova

PDF

TL;DR

This paper introduces a versatile framework for first-order optimization and variational inequalities using inexact models, unifying existing methods and enabling the development of new algorithms with optimal complexity.

Contribution

The paper presents a general inexact model framework that unifies many optimization methods and introduces a universal method for variational inequalities with optimal complexity.

Findings

01

Reproduces known optimization algorithms as special cases

02

Develops a universal method for variational inequalities

03

Achieves optimal complexity without prior smoothness knowledge

Abstract

In this paper we propose a general algorithmic framework for first-order methods in optimization in a broad sense, including minimization problems, saddle-point problems and variational inequalities. This framework allows to obtain many known methods as a special case, the list including accelerated gradient method, composite optimization methods, level-set methods, proximal methods. The idea of the framework is based on constructing an inexact model of the main problem component, i.e. objective function in optimization or operator in variational inequalities. Besides reproducing known results, our framework allows to construct new methods, which we illustrate by constructing a universal method for variational inequalities with composite structure. This method works for smooth and non-smooth problems with optimal complexity without a priori knowledge of the problem smoothness. We also…

Equations497

f (x) \to x \in Q min .

f (x) \to x \in Q min .

f (y) + ⟨ \nabla_{δ} f (y), x - y ⟩ - δ \leq f (x) \leq f (y) + ⟨ \nabla_{δ} f (y), x - y ⟩ + \frac{L}{2} ∥ x - y ∥_{2}^{2} + δ,

f (y) + ⟨ \nabla_{δ} f (y), x - y ⟩ - δ \leq f (x) \leq f (y) + ⟨ \nabla_{δ} f (y), x - y ⟩ + \frac{L}{2} ∥ x - y ∥_{2}^{2} + δ,

α ⟨ \nabla_{δ} f (y), x - y ⟩ + \frac{1}{2} ∥ x - y ∥_{2}^{2} \to x \in Q min,

α ⟨ \nabla_{δ} f (y), x - y ⟩ + \frac{1}{2} ∥ x - y ∥_{2}^{2} \to x \in Q min,

f (x_{N}) - f (x_{*}) = O (\frac{L R ^{2}}{N ^{p}} + N^{1 - p} \tilde{δ} + N^{p - 1} δ),

f (x_{N}) - f (x_{*}) = O (\frac{L R ^{2}}{N ^{p}} + N^{1 - p} \tilde{δ} + N^{p - 1} δ),

f (y) + ψ_{δ} (x, y) - δ \leq f (x) \leq f (y) + ψ_{δ} (x, y) + \frac{L}{2} ∥ x - y ∥_{2}^{2} + δ,

f (y) + ψ_{δ} (x, y) - δ \leq f (x) \leq f (y) + ψ_{δ} (x, y) + \frac{L}{2} ∥ x - y ∥_{2}^{2} + δ,

α ψ_{δ} (x, y) + \frac{1}{2} ∥ x - y ∥_{2}^{2} \to x \in Q min,

α ψ_{δ} (x, y) + \frac{1}{2} ∥ x - y ∥_{2}^{2} \to x \in Q min,

0 \leq f (x) - (f_{δ} (y) + ψ_{δ} (x, y)) \leq L V [y] (x) + δ

0 \leq f (x) - (f_{δ} (y) + ψ_{δ} (x, y)) \leq L V [y] (x) + δ

∥ \nabla f (x) - \nabla f (y) ∥_{*} \leq L ∥ x - y ∥, \forall x, y \in Q .

∥ \nabla f (x) - \nabla f (y) ∥_{*} \leq L ∥ x - y ∥, \forall x, y \in Q .

0 \leq f (x) - f (y) - ⟨ \nabla f (y), x - y ⟩ \leq \frac{L}{2} ∥ x - y ∥^{2} \forall x, y \in Q .

0 \leq f (x) - f (y) - ⟨ \nabla f (y), x - y ⟩ \leq \frac{L}{2} ∥ x - y ∥^{2} \forall x, y \in Q .

ψ_{δ} (x, y) := ⟨ \nabla f (y), x - y ⟩ \forall x, y \in Q

ψ_{δ} (x, y) := ⟨ \nabla f (y), x - y ⟩ \forall x, y \in Q

f (x) := g (x) + h (x) \to x \in Q min,

f (x) := g (x) + h (x) \to x \in Q min,

0 \leq f (x) - f (y) - ⟨ \nabla g (y), x - y ⟩ - h (x) + h (y) \leq \frac{L}{2} ∥ x - y ∥^{2}, \forall x, y \in Q .

0 \leq f (x) - f (y) - ⟨ \nabla g (y), x - y ⟩ - h (x) + h (y) \leq \frac{L}{2} ∥ x - y ∥^{2}, \forall x, y \in Q .

ψ_{δ} (x, y) = ⟨ \nabla g (y), x - y ⟩ + h (x) - h (y),

ψ_{δ} (x, y) = ⟨ \nabla g (y), x - y ⟩ + h (x) - h (y),

f (x) := g (g_{1} (x), \dots, g_{m} (x)) \to x \in Q min

f (x) := g (g_{1} (x), \dots, g_{m} (x)) \to x \in Q min

0 \leq f (x) - g (g_{1} (y) + ⟨ \nabla g_{1} (y), x - y ⟩, \dots, g_{m} (y) + ⟨ \nabla g_{m} (y), x - y ⟩) \leq

0 \leq f (x) - g (g_{1} (y) + ⟨ \nabla g_{1} (y), x - y ⟩, \dots, g_{m} (y) + ⟨ \nabla g_{m} (y), x - y ⟩) \leq

\leq M \frac{\sum _{i = 1}^{m} L _{i}}{2} ∥ x - y ∥^{2} \forall x, y \in Q .

0 \leq f (x) - f (y) - g (g_{1} (y) + ⟨ \nabla g_{1} (y), x - y ⟩, \dots, g_{m} (y) + ⟨ \nabla g_{m} (y), x - y ⟩) + f (y) \leq

0 \leq f (x) - f (y) - g (g_{1} (y) + ⟨ \nabla g_{1} (y), x - y ⟩, \dots, g_{m} (y) + ⟨ \nabla g_{m} (y), x - y ⟩) + f (y) \leq

\leq M \frac{\sum _{i = 1}^{m} L _{i}}{2} ∥ x - y ∥^{2} \forall x, y \in Q .

ψ_{δ} (x, y) = g (g_{1} (y) + ⟨ \nabla g_{1} (y), x - y ⟩, \dots, g_{m} (y) + ⟨ \nabla g_{m} (y), x - y ⟩) - f (y),

ψ_{δ} (x, y) = g (g_{1} (y) + ⟨ \nabla g_{1} (y), x - y ⟩, \dots, g_{m} (y) + ⟨ \nabla g_{m} (y), x - y ⟩) - f (y),

ψ_{δ} (x, y) = f (x) - f (y)

ψ_{δ} (x, y) = f (x) - f (y)

f (x) := z \in Q min F (z, x) \to x \in R^{n} min .

f (x) := z \in Q min F (z, x) \to x \in R^{n} min .

∥ \nabla F (z^{'}, x^{'}) - \nabla F (z, x) ∥_{2} \leq L ∥ (z^{'}, x^{'}) - (z, x) ∥_{2}, \forall z, z^{'} \in Q, x, x^{'} \in R^{n} .

∥ \nabla F (z^{'}, x^{'}) - \nabla F (z, x) ∥_{2} \leq L ∥ (z^{'}, x^{'}) - (z, x) ∥_{2}, \forall z, z^{'} \in Q, x, x^{'} \in R^{n} .

⟨ \nabla_{z} F (z_{δ} (x), x), z - z_{δ} (x)⟩ \geq - δ, \forall z \in Q,

⟨ \nabla_{z} F (z_{δ} (x), x), z - z_{δ} (x)⟩ \geq - δ, \forall z \in Q,

ψ_{δ} (x, y) = ⟨ \nabla_{z} F (z_{δ} (y), y), x - y ⟩

ψ_{δ} (x, y) = ⟨ \nabla_{z} F (z_{δ} (y), y), x - y ⟩

f_{δ_{k}} (x_{k + 1}) \leq f_{δ_{k}} (x_{k}) + ψ_{δ_{k}} (x_{k + 1}, x_{k}) + L_{k + 1} V [x_{k}] (x_{k + 1}) + δ_{k},

f_{δ_{k}} (x_{k + 1}) \leq f_{δ_{k}} (x_{k}) + ψ_{δ_{k}} (x_{k + 1}, x_{k}) + L_{k + 1} V [x_{k}] (x_{k + 1}) + δ_{k},

\phi_{k+1}(x)={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L_{k+1}\cdot}\left(V[x_{k}](x)+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\alpha_{k+1}}\psi_{\delta_{k}}(x,x_{k})\right),\quad x_{k+1}:=\arg\min_{x\in Q}\phi_{k+1}(x).

\phi_{k+1}(x)={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L_{k+1}\cdot}\left(V[x_{k}](x)+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\alpha_{k+1}}\psi_{\delta_{k}}(x,x_{k})\right),\quad x_{k+1}:=\arg\min_{x\in Q}\phi_{k+1}(x).

f (\overset{x}{ˉ}_{N}) - f (x_{*}) \leq \frac{R ^{2}}{A _{N}} + \frac{2}{A _{N}} k = 0 \sum N - 1 α_{k + 1} δ_{k} \leq \frac{2 L R ^{2}}{N} + \frac{2}{A _{N}} k = 0 \sum N - 1 α_{k + 1} δ_{k},

f (\overset{x}{ˉ}_{N}) - f (x_{*}) \leq \frac{R ^{2}}{A _{N}} + \frac{2}{A _{N}} k = 0 \sum N - 1 α_{k + 1} δ_{k} \leq \frac{2 L R ^{2}}{N} + \frac{2}{A _{N}} k = 0 \sum N - 1 α_{k + 1} δ_{k},

f_{δ_{k}} (x_{k + 1}) \leq f_{δ_{k}} (y_{k + 1}) + ψ_{δ_{k}} (x_{k + 1}, y_{k + 1}) + \frac{L _{k + 1}}{2} ∥ x_{k + 1} - y_{k + 1} ∥^{2} + δ_{k},

f_{δ_{k}} (x_{k + 1}) \leq f_{δ_{k}} (y_{k + 1}) + ψ_{δ_{k}} (x_{k + 1}, y_{k + 1}) + \frac{L _{k + 1}}{2} ∥ x_{k + 1} - y_{k + 1} ∥^{2} + δ_{k},

y_{k + 1} := \frac{α _{k + 1} u _{k} + A _{k} x _{k}}{A _{k + 1}}

y_{k + 1} := \frac{α _{k + 1} u _{k} + A _{k} x _{k}}{A _{k + 1}}

\phi_{k+1}(x)={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L_{k+1}}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\cdot}(V[u_{k}](x)+\alpha_{k+1}\psi_{\delta_{k}}(x,y_{k+1})),\quad u_{k+1}:=\operatorname*{argmin}_{x\in Q}\phi_{k+1}(x)

\phi_{k+1}(x)={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L_{k+1}}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\cdot}(V[u_{k}](x)+\alpha_{k+1}\psi_{\delta_{k}}(x,y_{k+1})),\quad u_{k+1}:=\operatorname*{argmin}_{x\in Q}\phi_{k+1}(x)

x_{k + 1} := \frac{α _{k + 1} u _{k + 1} + A _{k} x _{k}}{A _{k + 1}}

x_{k + 1} := \frac{α _{k + 1} u _{k + 1} + A _{k} x _{k}}{A _{k + 1}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\coltauthor\Name

Fedor Stonyakin \[email protected]

\addrV. Vernadsky Crimean Federal University, Moscow Institute of Physics and Technology and \NameAlexander Gasnikov \[email protected]

\addrMoscow Institute of Physics and Technology, Institute for Information Transmission Problems, National Research University Higher School of Economics and \NameAlexander Tyurin \[email protected]

\addrNational Research University Higher School of Economics and \NameDmitry Pasechnyuk \[email protected]

\addr239-th school of St. Petersburg and \NameArtem Agafonov \[email protected]

\addrMoscow Institute of Physics and Technology and \NamePavel Dvurechensky \[email protected]

\addrWeierstrass Institute for Applied Analysis and Stochastics and \NameDarina Dvinskikh \[email protected]

\addrWeierstrass Institute for Applied Analysis and Stochastics and \NameAlexey Kroshnin \[email protected]

\addrInstitute for Information Transmission Problems, National Research University Higher School of Economics and \NameVictorya Piskunova \[email protected]

\addrV. Vernadsky Crimean Federal University

Inexact Model: A Framework for Optimization and Variational Inequalities

(September 2, 2018)

Abstract

In this paper we propose a general algorithmic framework for first-order methods in optimization in a broad sense, including minimization problems, saddle-point problems and variational inequalities. This framework allows to obtain many known methods as a special case, the list including accelerated gradient method, composite optimization methods, level-set methods, proximal methods. The idea of the framework is based on constructing an inexact model of the main problem component, i.e. objective function in optimization or operator in variational inequalities. Besides reproducing known results, our framework allows to construct new methods, which we illustrate by constructing a universal method for variational inequalities with composite structure. This method works for smooth and non-smooth problems with optimal complexity without a priori knowledge of the problem smoothness. We also generalize our framework for strongly convex objectives and strongly monotone variational inequalities.

keywords:

Convex optimization, composite optimization, proximal method, level-set method, variational inequality, universal method, mirror prox, acceleration, relative smoothness

1 Introduction

We consider convex optimization problem

[TABLE]

It’s well known (see [Devolder et al.(2014)Devolder, Glineur, and Nesterov, Dvurechensky et al.(2017a)Dvurechensky, Gasnikov, and Kamzolov]) that if for all $x,y\in Q$

[TABLE]

then assuming that for proper $\alpha$ we can solve with $\tilde{\delta}$ ‘precision’ of auxiliary problems at each iteration

[TABLE]

one can prove that Gradient Method (GM) and Fast Gradient Method (FGM) converge as follows

[TABLE]

where $p=1$ for GM and $p=2$ for FGM, $x_{*}$ – is a solution of (1).

The first goal111This goal has already been realized in our previous works [Gasnikov(2017), Tyurin and Gasnikov(2017)]. Here we formulate these results for completeness. of this paper is to show that if instead of function (model) $\langle\nabla_{\delta}f(y),x-y\rangle$ linear in $x$ we take arbitrary function $\psi_{\delta}(x,y)$ (with $\psi_{\delta}(x,x)=0$ ) convex in $x$ such that for arbitrary $x,y\in Q$

[TABLE]

then assuming that for proper $\alpha$ we can solve with $\tilde{\delta}$ ‘precision’ of auxiliary problems at each iteration

[TABLE]

one can prove that corresponding ‘model’ versions of Gradient Method (GM) and Fast Gradient Method (FGM) converge with the same rates (2). It should be noted, that not every variant of fast gradient method is well suited for such a ‘model’s generalization’. It is significant that proper variant of FGM is based on accelerated mirror descent type of the method by [Tseng(2008), Lan(2012), Dvurechensky et al.(2017b)Dvurechensky, Gasnikov, Omelchenko, and Tiurin] which solves only one auxiliary problem of mirror descent type (not dual averaging) at each iteration.

In particular, as simple corollaries these results allow to obtain the standard facts about the convergence rates of composite (accelerated) gradient methods presented in [Beck and Teboulle(2009), Nesterov(2013)] for $f(x):=g(x)+h(x)$ , $\psi_{\delta}(x,y)=\langle\nabla g(y),x-y\rangle+h(x)-h(y)$ and level (accelerated) gradient methods from [Nemirovskii and Nesterov(1985), Lan(2015)] for $f(x):=g(g_{1}(x),\dots,g_{m}(x))$ , $\psi_{\delta}(x,y)=g(g_{1}(y)+\langle\nabla g_{1}(y),x-y\rangle,\dots,g_{m}(y)+\langle\nabla g_{m}(y),x-y\rangle)-f(y)$ .

The second goal222The idea was proposed in [Gasnikov(2017)]. Here we realized this idea more generally. is to generalize the results mentioned above to the non-Euclidian prox set-up. Moreover, for GM we combine our model conception with the conception of relative smoothness from [Bauschke et al.(2016)Bauschke, Bolte, and Teboulle, Lu et al.(2018)Lu, Freund, and Nesterov]. As a byproduct we reproduce a proximal gradient method in non-Euclidian set-up [Chen and Teboulle(1993)] (choosing $\psi_{\delta}(x,y)=f(x)-f(y)$ ). We demonstrate the value of reproduced method by applying it to Wasserstein distance calculation problem with KL-prox set-up [Dvurechensky et al.(2018a)Dvurechensky, Gasnikov, and Kroshnin, Xie et al.(2018)Xie, Wang, Wang, and Zha, Stonyakin et al.(2019)Stonyakin, Dvinskikh, Dvurechensky, Kroshnin, Kuznetsova, Agafonov, Gasnikov, Tyurin, Uribe, Pasechnyuk, and Artamonov].

The third goal is to supplement the set of examples of inexact gradient oracle from [Devolder et al.(2014)Devolder, Glineur, and Nesterov]. In particular, we consider the following set up333This example was taken from [Gasnikov et al.(2015)Gasnikov, Dvurechensky, Kamzolov, Nesterov, Spokoiny, Stetsyuk, Suvorikova, and Chernov]. $f(x):=\min_{y\in Q}F(y,x)$ (changing $\max$ to $\min$ in [Devolder et al.(2014)Devolder, Glineur, and Nesterov]). As a byproduct of Moreau envelope smoothing example from [Devolder et al.(2014)Devolder, Glineur, and Nesterov] we reproduce Catalyst approach by [Lin et al.(2015)Lin, Mairal, and Harchaoui].

The fourth goal444We try to implement this goal based on the works [Dvurechensky et al.(2017b)Dvurechensky, Gasnikov, Omelchenko, and Tiurin, Dvurechensky et al.(2018b)Dvurechensky, Gasnikov, Stonyakin, and Titov, Gasnikov(2017), Stonyakin(2019)]. is to generalize the model set-up with relative smoothness to a vector field and monotone variational inequalities (VI). We propose a proper model generalization of optimal method for VI: Mirror Prox from [Nemirovski(2004)]. As a byproduct this generalization allows to partially reproduce the results from [Chambolle and Pock(2011)].

The fifth goal is to propose universal variants (see [Nesterov(2015)]) of the methods described above. To the best of our knowledge there is no (optimal) universal method for VI even without model generality in English.555Universal method for VI was firstly proposed Russian’s book [Gasnikov(2017)]. In this book one can also find announcement of possibility of model generalization. In preprint (on Russian) [Stonyakin(2019)] one can find universal model generalization.

The sixth goal is to generalize the results mentioned above for strongly convex problems and strongly monotone VI. Note, that for accelerated methods (FGM) we may use the standard restart scheme, see, e.g. [Dvurechensky et al.(2017b)Dvurechensky, Gasnikov, Omelchenko, and Tiurin] but for non-accelerated methods (GM) there exists a possibility to eliminate restarts. Moreover, there exist different possibilities to determine the model conception in strongly convex case, which we compare in this paper: i) strongly convex objective $f$ ; ii) function $\psi_{\delta}(y,x)$ strongly convex in $y$ ; iii) like in [Devolder et al.(2013)Devolder, Glineur, Nesterov, et al.].

Although the unified structure of first-order methods is not new, see, e.g. [Nemirovsky and Yudin(1983), Mairal(2013), Ochs et al.(2017)Ochs, Fadili, and Brox], our approach generalizes only linear part of objective function approximation, that allows to combine more facts together and keeps prospects for further generalizations. In particular, our proposed model conception and corresponding GM and FGM can be considered from a primal-dual point of view as in [Nesterov(2009), Nemirovski et al.(2010)Nemirovski, Onn, and Rothblum] and block-coordinate generality as in [Dvurechensky et al.(2017c)Dvurechensky, Gasnikov, and Tiurin].

2 Inexact Model for Minimization

2.1 Definitions and Examples

We start with the general notation. Let $E$ be a finite-dimensional real vector space and $E^{*}$ be its dual. We denote the value of a linear function $g\in E^{*}$ at $x\in E$ by $\langle g,x\rangle$ . Let $\|\cdot\|$ be some norm on $E$ , $\|\cdot\|_{*}$ be its dual, defined by $\|g\|_{*}=\max\limits_{x}\big{\{}\langle g,x\rangle,\|x\|\leq 1\big{\}}$ . We use $\nabla f(x)$ to denote any subgradient of a function $f$ at a point $x\in{\rm dom}f$ .

Consider convex optimization problem (1).

Definition 2.1.

Suppose that for a given point $y\in Q$ and for all $x\in Q$ the inequality

[TABLE]

*holds for some $\psi_{\delta}(x,y)$ , $f_{\delta}(y)\in[f(y)-\delta;f(y)]$ , $L$ , $\delta>0$ and $V[y](x)=d(x)-d(y)-\langle\nabla d(y),x-y\rangle$ , where $d(x)$ is convex function on $Q$ . Let $\psi_{\delta}(x,y)$ be convex in $x\in Q$ and satisfy $\psi_{\delta}(x,x)=0$ for all $x\in Q$ . Then we say that $\psi_{\delta}(x,y)$ is ( $\delta$ , L)-model of the function $f$ at a given point $y$ with respect to (w.r.t.) $V[y](x)$ . *

Remark 2.2.

Function $V[y](x)$ , defined above as $V[y](x)=d(x)-d(y)-\langle\nabla d(y),x-y\rangle$ is often called Bregman divergence [Ben-Tal and Nemirovski(2015)]. But typically it should be added the (1-SC) assumption in definition: $d(x)$ is $1$ -strongly convex on $Q$ w.r.t. $\|\cdot\|$ -norm. Note that in Definition 2.1 we do not need such assumption. But sometimes we also use the definition of $V[y](x)$ in the description of algorithms below and corresponding theorems of convergences rates separately. If additionally the condition (1-SC) is required we write it explicitly, see, e.g. Section 2.3.

Remark 2.3.

We change ‘w.r.t $V[y](x)$ ’ to ‘w.r.t. $\|\cdot\|$ -norm’ in Definition 2.1 if we use $\frac{1}{2}\|x-y\|^{2}$ instead of $V[y](x)$ . Typically, the (1-SC) condition (see Remark 2.2) on $V[y](x)$ in description of algorithms and theorem statements required below if one deal with the model w.r.t. $\|\cdot\|$ -norm.

Remark 2.4.

Note that model definition from Remark 2.3 is close to the definition from [Devolder et al.(2014)Devolder, Glineur, and Nesterov]: function $f$ has $(\delta,L)$ -oracle at a given point $y$ if there exists a pair $(f_{\delta}(y),\nabla f_{\delta}(y))$ such that for all $x\in Q$ : $0\leq f(x)-f_{\delta}(y)-\langle\nabla f_{\delta}(y),x-y\rangle\leq\frac{L}{2}\left\lVert x-y\right\rVert^{2}+\delta$ .

Now we consider some examples in which the concept of $(\delta,L)$ -model of objective function is useful. Let us start with some standard examples.

Example 2.5.

Convex optimization problem with Lipschitz continuous gradient, [Nesterov(2004)]**

If convex function $f$ has Lipschitz continuous gradient:

[TABLE]

then

[TABLE]

In this case

[TABLE]

is $\left(0,L\right)$ -model of $f$ with $f_{\delta}(y)=f(y)$ at a given point $y$ w.r.t. $\|\cdot\|$ -norm.

Example 2.6.

Composite optimization, [Beck and Teboulle(2009), Nesterov(2013)]**

Let us consider composite convex optimization problem:

[TABLE]

where $g$ is a smooth convex function and the gradient of $g$ is Lipschitz continuous with parameter $L$ . Function $h$ is a simple convex function. One can show

[TABLE]

Therefore

[TABLE]

is $\left(0,L\right)$ -model of $f$ with $f_{\delta}(y)=f(y)$ at a given point $y$ w.r.t. $\|\cdot\|$ -norm.

Example 2.7.

Superposition of functions, [Nemirovskii and Nesterov(1985)]**

Let us consider the following optimization problem [Lan(2015)]:

[TABLE]

where each function $g_{k}(x)$ is a smooth convex function with $L_{k}$ -Lipschitz gradient w.r.t. $\|\cdot\|$ -norm for all $k$ . Function $g(x)$ is a $M$ -Lipschitz convex function w.r.t 1-norm, non-decreasing in each of its arguments. From these assumptions we have ([Boyd and Vandenberghe(2004), Lan(2015)]) that function $f(x)$ is also convex function and the following inequality holds (see [Lan(2015)]):

[TABLE]

Also

[TABLE]

Therefore

[TABLE]

is $\left(0,M\cdot\left(\sum_{i=1}^{m}L_{i}\right)\right)$ -model of $f$ with $f_{\delta}(y)=f(y)$ at a given point $y$ w.r.t. $\|\cdot\|$ -norm. It should be note that problems (9) and (13) can be more complicated compared to traditional case when we solve smooth convex optimization problem with Lipschitz gradient.

Example 2.8.

Proximal method, [Chen and Teboulle(1993)]**

Let us consider optimization problem (1), where $f$ is an arbitrary convex function (not necessarily smooth). Then for arbitrary $L\geq 0$

[TABLE]

is $(0,L)$ -model of $f$ with $f_{\delta}(y)=f(y)$ at a given point $y$ w.r.t $V[y](x)$ , see Definition 1 and Remark 2.2. Gradient method (see666To say more precisely if we deal with proximal model (see also Remark A.4 and Examples A.5, A.6) it is worth to use non adaptive algorithm, with fixed $L$ . Algorithm 1) with the proposed model is equivalent to the proximal method with general Bregman divergence instead of Euclidean one [Parikh and Boyd(2014)]. We discus this model in more details in Appendix A. In particular, based on this model (with Bregman divergence to be Kullback–Leibler divergence) and Algorithm 1 we propose proximal Sinkhorn’s algorithm for Wasserstein distance calculation problem (see [Stonyakin et al.(2019)Stonyakin, Dvinskikh, Dvurechensky, Kroshnin, Kuznetsova, Agafonov, Gasnikov, Tyurin, Uribe, Pasechnyuk, and Artamonov]). Also we explain, what difficulties arise in an attempt to propose accelerated method deal with this model. The problem is that the complexity of auxiliary problems growth with the iteration number. So we introduce another model and, based on this model, we construct accelerated proximal method and show that the Catalyst approach [Lin et al.(2015)Lin, Mairal, and Harchaoui] for generic acceleration can be derived using this model.

Example 2.9.

Min-min problem**

Consider optimization problem:

[TABLE]

*Set Q is convex and bounded. Function F is smooth and convex w.r.t. all variables. Moreover, *

[TABLE]

If we can find a point $\widetilde{z}_{\delta}(x)\in Q$ such that

[TABLE]

then $F(\widetilde{z}_{\delta}(x),x)-f(x)\leq\delta$ , $\left\lVert\nabla f(x^{\prime})-\nabla f(x)\right\rVert_{2}\leq L\left\lVert x^{\prime}-x\right\rVert_{2}$ and

[TABLE]

is $(6\delta,2L)$ -model of $f$ with $f_{\delta}(y)=F(\widetilde{z}_{\delta}(y),y)-2\delta$ at a given point $y$ w.r.t 2-norm.

2.2 Gradient Method with Inexact Model

In this section we consider a simple non-accelerated method for optimization problems with $(\delta,L)$ -model. This method is a variant of the standard gradient method [Polyak(1987)] with adaptive tuning to the Lipschitz constant of the gradient of the objective function [Nesterov(2013)].

We assume that on each iteration $k$ , the method has access to $(\delta_{k},L)$ -model of $f$ w.r.t. $V[y](x)$ (see Definition 2.1). Depending on the problem, $\delta_{k}$ can be equal to zero, constant value or change from iteration to iteration.

Theorem 2.10.

Let $V[x_{0}](x_{*})\leq R^{2}$ , where $x_{0}$ is the starting point, and $x_{*}$ is the nearest minimum point to the point $x_{0}$ in the sense of Bregman divergence (see Remark 2.2). Then, for the sequence, generated by Algorithm 1 the following holds

[TABLE]

where $A_{N}\geq\frac{N}{2L}$ . Moreover, the total number of attempts to solve (9) is bounded by $2N+\log_{2}\frac{L}{L_{0}}$ .

Remark 2.11.

If the $(\delta_{k},L)$ -model of $f$ is given w.r.t. $\|\cdot\|$ -norm (see Remark 2.3), then the chosen $V[y](x)$ in Algorithm 1 and Theorem 2.10 has to satisfy (1-SC) condition w.r.t. this norm (see Remarks 2.2, 2.3).

2.3

Fast Gradient Method with Inexact Model

In this section we consider accelerated method for problems with $(\delta,L)$ -model. The method is close to accelerated mirror-descent type of methods by [Tseng(2008), Lan(2012), Dvurechensky et al.(2018a)Dvurechensky, Gasnikov, and Kroshnin]. On each iteration, the inexact model is used to make a mirror-descent-type of step. In this section, we assume that the $(\delta_{k},L)$ -model of $f$ is given w.r.t. $\|\cdot\|$ -norm and $V[u](x)$ satisfies (1-SC) condition w.r.t. this norm (see Remarks 2.2, 2.3, 2.11).

Theorem 2.12.

*Let $V[x_{0}](x_{*})\leq R^{2}$ , where $x_{0}$ is the starting point and $x_{*}$ is the nearest minimum point to $x_{0}$ in the sense of Bregman divergence. Then, for the sequence, generated by Algorithm 2, *

[TABLE]

where $A_{k}\geq\frac{(k+1)^{2}}{8L}$ . Moreover, the total number of attempts to solve (13) is bounded by $4N+\log_{2}\frac{L}{L_{0}}$ .

3 Inexact Model for Variational Inequalities

In this section, we go beyond minimization problems and propose an abstract inexact model counterpart for variational inequalities. Using this model we propose a new universal method for variational inequalities with complexity $O\left(\inf_{\nu\in[0,1]}\left(\frac{1}{\varepsilon}\right)^{\frac{2}{1+\nu}}\right)$ , where $\varepsilon$ is the desired accuracy of the solution. According to the lower bounds in [Ouyang and Xu(2018)], this algorithm is optimal for $\nu=0$ and $\nu=1$ . Based on the model for VI and functions, we extend $(\delta,L)$ -model for saddle-point problems (see Appendix F). We are also motivated by mixed variational inequalities [I. V. Konnov(2017), T. Q. Bao(2006)] and composite saddle-point problems [Chambolle and Pock(2011)].

We consider the problem of finding the solution $x_{*}\in Q$ for VI in the following abstract form

[TABLE]

for some convex compact set $Q\subset\mathbb{R}^{n}$ and some function $\psi:Q\times Q\rightarrow\mathbb{R}$ . Assuming the abstract monotony of the function $\psi$

[TABLE]

any solution (15) will is a solution of the following inequality

[TABLE]

In the general case, we make an assumption about the existence of a solution $x_{*}$ of the problem (15). As a particular case, if for some operator $g:Q\rightarrow\mathbb{R}^{n}$ we set $\psi(x,y)=\langle g(y),x-y\rangle\;\;\forall x,y\in Q$ , then (15) and (17) are equivalent, respectively, to a standard strong and weak variational inequality with the operator $g$ .

Example 3.1.

For some operator $g:Q\rightarrow\mathbb{R}^{n}$ and a convex functional $h:Q\rightarrow\mathbb{R}^{n}$ choice

[TABLE]

leads to a mixed variational inequality from [I. V. Konnov(2017), T. Q. Bao(2006)]

[TABLE]

which in the case of the monotonicity of the operator $g$ implies

[TABLE]

We propose an adaptive proximal method for the problems (15) and (17). We start with a concept of $(\delta,L)$ -model for such problems.

Definition 3.2.

We say that functional $\psi$ has $(\delta,L)$ -model $\psi_{\delta}(x,y)$ at a given point $y$ w.r.t. $V[y](x)$ if the following properties hold for each $x,y,z\in Q$ :

(i)

$\psi_{\delta}(x,y)$ * convex in the first variable;* 2. (ii)

$\psi_{\delta}(x,x)=0$ ; 3. (iii)

(abstract $\delta$ -monotonicity)

[TABLE] 4. (iv)

(generalized relative smoothness)

[TABLE]

for some fixed values $L>0$ , $\delta>0$ .

Remark 3.3.

Similarly to Definition 2.1 above, in general case, we do not need the (1-SC) assumption in Definition 3.2 for $V[y](x)$ . In some situations we assume that (1-SC) assumption holds (see Examples 3.5, 3.6 and Appendix G).

Remark 3.4.

In Definition 3.2 we change ‘w.r.t $V[y](x)$ ’ to ‘w.r.t. $\|\cdot\|$ -norm if we use $\frac{1}{2}\|x-y\|^{2}$ instead of $V[y](x)$ .

Note that for $\delta=0$ the following analogue of (22) for some fixed $a,b>0$

[TABLE]

was introduced in [Mastroeni(2000)]. Condition (23) is used in many works on equilibrium programming. Our approach allows us to work with non-Euclidean set-up without (1-SC) assumption and inexactness $\delta$ , that is important for the ideology of universal methods [Nesterov(2015)] (see Example 3.6 below).

One can directly verify that if $\psi_{\delta}(x,y)$ is $(\delta/5,L)$ -model of the function $f$ at a given point $y$ w.r.t. $V[y](x)$ then $\psi_{\delta}(x,y)$ is $(\delta,L)$ -model in the sense of Definition 3.2 w.r.t. $V[y](x)$ .

Let us consider some examples.

Example 3.5.

Variational Inequalities with monotone Lipshitz continuous operator.* Consider variational inequality of finding $x\in Q$ such that $\langle g(y),x-y\rangle\leq 0$ , $\forall y\in Q$ , the operator $g:Q\rightarrow R^{n}$ is monotone and Lipschitz continuous, i.e. $\left\lVert g(x)-g(y)\right\rVert_{*}\leq L\left\lVert x-y\right\rVert,\,\,\,\forall x,y\in Q.$ In this case $\psi_{\delta}(x,y):=\langle g(y),x-y\rangle$ is a ( $\delta$ , L)-model in a sense of Definition 3.2 w.r.t. $\|\cdot\|$ -norm ( $\forall x,y\in Q$ ).*

Example 3.6.

Variational Inequalities with monotone Holder continuous operator.*

Assume that for monotone operator $g$ there exists $\nu\in[0,1]$ such that*

[TABLE]

Then we have: $\langle g(z)-g(y),z-x\rangle\leq\|g(z)-g(y)\|_{*}\|z-x\|\leq$

[TABLE]

for

[TABLE]

and uncontrolled parameter $\delta>0$ . In this case the following function

[TABLE]

is ( $\delta$ , L)-model w.r.t. $\|\cdot\|$ -norm.

Note that for the previous two examples in Algorithm 3 and Theorem 3.7 we need $V[z](x)$ to satisfy (1-SC) condition.

Next, we introduce our novel adaptive method (Algorithm 3 ) for abstract variational inequalities with inexact $(\delta,L)$ -model777Here we assume that $\delta$ doesn’t change on iterations. We allow $\delta$ to change before (e.g. in Section 2.3) for possibility to build universal fast gradient method, see Example A.7. But for non accelerated methods it is not necessary. In Section 2.2 we, actually, change $\delta$ on iteration for the convenience of comparison the results of Sections 2.2 and 2.3. w.r.t. $V[y](x)$ . If $V[y](x)$ satisfies (1-SC) condition then we can consider inexact $(\delta,L)$ -model w.r.t. $\|\cdot\|$ -norm. This method adapts to the local values of $L$ and similarly to [Nesterov(2015)] allows us to construct universal method for variational inequalities. Applying the following adaptive Algorithm 3 to VI with Holder interpolation (25) for $\delta=\frac{\varepsilon}{2}$ and $L=L\left(\frac{\varepsilon}{2}\right)$ leads us to universal method for VI.

For a given accuracy $\varepsilon$ we can consider the following stopping criterion for Algorithm 3:

[TABLE]

Let us formulate the following result

Theorem 3.7.

For Algorithm 3 the following inequalities hold

[TABLE]

Proof 3.8.

After $(k+1)$ -th iteration ( $k=0,1,2,\ldots$ ) we have for each $u\in Q$ :

[TABLE]

and

[TABLE]

Taking into account (28), we obtain

[TABLE]

So, the following inequality

[TABLE]

holds. By virtue of (22) and the choice of $L_{0}\leqslant 2L$ , it is guaranteed that

[TABLE]

and we have

[TABLE]

Remark 3.9.

To obtain precision $\varepsilon+\delta$ Algorithm 3 works no more than

[TABLE]

iterations. Note that estimate (32) is optimal for variational inequalities and saddle-point problems [Ouyang and Xu(2018)].

For universal method to obtain precision $\varepsilon$ we can choose $\delta=\frac{\varepsilon}{2}$ and $L=L\left(\frac{\varepsilon}{2}\right)$ according to (25) and (26) and the estimate (32) reduces to

[TABLE]

Note that similarly to Algorithms 1 and 2, the total number of attempts to solve (29) and (30) is bounded by $4N+\log_{2}\frac{L}{L_{0}}$ .

Thus, the introduced concept of the function model for variational inequalities allows us to extend the previously proposed universal method for VI to a wider class of problems, including mixed variational inequalities [I. V. Konnov(2017), T. Q. Bao(2006)] and composite saddle-point problems [Chambolle and Pock(2011)]. We extend $(\delta,L)$ -model for saddle-point problems in Appendix F further.

4 Concluding remarks

Firstly, note that for all considered methods we may also take into account inexactness for auxiliary problems using the following

Definition 4.1.

For a convex optimization problem

[TABLE]

we denote by $\text{Arg}\min_{x\in Q}^{\widetilde{\delta}}\Psi(x)$ a collection of $\widetilde{x}$ :

[TABLE]

Let us denote by $\arg\min_{x\in Q}^{\widetilde{\delta}}\Psi(x)$ some element of $\text{Arg}\min_{x\in Q}^{\widetilde{\delta}}\Psi(x)$ .

Note, that if $\Psi(x)$ is $\mu$ -strongly convex; has $L$ -Lipschitz continuous gradient in $\|\cdot\|$ norm888To say more precisely

$L=\max_{\|h\|\leq 1,x\in[\widetilde{x},x_{*}]}\langle h,\nabla^{2}\Psi(x)h\rangle.$

and $R=\max_{x,y\in Q}\|x-y\|$ , then $\Psi(\widetilde{x})-\Psi(x_{*})\leq\widetilde{\epsilon}$ entails that [Stonyakin et al.(2019)Stonyakin, Dvinskikh, Dvurechensky, Kroshnin, Kuznetsova, Agafonov, Gasnikov, Tyurin, Uribe, Pasechnyuk, and Artamonov]

[TABLE]

where $x_{*}=\operatorname*{argmin}_{x\in Q}\Psi(x)$ . If one can guarantee that $\nabla\Psi(x_{*})=0$ , then (36) can be improved

[TABLE]

Clearly, for the case $\widetilde{\delta}=0$ equation (35) means that $\widetilde{x}$ is an exact solution of (34). In Appendices B, C, D, G we show that inexactness for auxiliary problems (9), (13), (29), (30) according to Definition4.1 changes the estimates of the rate of convergence in all the methods no more than by additive term $O(\widetilde{\delta})$ , e.g. see (2) for problem (1). Similarly, in Appendix E and F for variational inequalities (VI) with monotone Lipshitz continuous operator we obtain

[TABLE]

and for convex-concave saddle-point problems of finding $\min_{u\in Q_{1}}\max_{v\in Q_{2}}f(u,v)$ we have

[TABLE]

Secondly, note that in the case of $\mu$ -strongly convex objective (model) the estimates for the proposed minimization methods can be improved. In the same way, this also applies to the method for (VI) in the case of the strong monotonicity of the operator (model). Details are described in appendices D and G. In all the cases by restart procedure from (2), one can obtain a linear rate of convergence, e.g. for problem (1) we get the following improved variant of (2) ( $\Delta f=f(x^{0})-f(x_{*})$ ):

[TABLE]

where $p=1$ for GM and $p=2$ for restarted FGM.

Finally, all the methods considered in this paper have universal (see [Nesterov(2015)]) extensions which allow to solve smooth and non-smooth problems without the prior knowledge of the smoothness level of the problem (Example A.7).

This paper is a full English version of our results, that was written on Russian [Gasnikov(2017), Tyurin and Gasnikov(2017)]. In this paper we also add new results concerning ‘model’ generalization of VI and generalization all the results to strongly convex case (in [Gasnikov(2017)] such a possibility was only announced). We also add some examples.

\acks

The authors are grateful to Yurii Nesterov for fruitful discussions.

The work of F. Stonyakin on model of vector field and Universal Mirror Prox for this field was supported by Russian Science Foundation according to the research project 18-71-00048, the work of A. Gasnikov on the conception of model of function at a given point and GM with relative smoothness context was supported by RFBR 18-31-20005 mol $\_$ a $\_$ ved, the work of P. Dvurechensky on literature survey and general structure of the paper was supported by RFBR 18-29-03071 mk, the work of A. Tyurin in model’s FGM was prepared within the framework of the HSE University Basic Research Program and funded by the Russian Academic Excellence Project ’5-100’, the work of D. Pasechnyk and D. Dvinskikh on proximal Sinkhorn method was fulfilled in July 2018 in Sirius (Sochi).

Appendix A Model examples

In this appendix we present different examples of a $(\delta,L)$ -model of objective $f$ .

Example A.1.

Saddle point problem, [Devolder et al.(2014)Devolder, Glineur, and Nesterov]**

Let us consider

[TABLE]

where $\phi(z)$ is a $\mu$ -strong convex function w.r.t. $p$ -norm ( $1\leq p\leq 2$ ). Then $f$ is a smooth convex function and the gradient of $f$ is Lipschitz continuous with parameter

[TABLE]

If $z_{\delta}(y)\in Q$ is a solution of auxiliary max-problem in the following sense

[TABLE]

then

[TABLE]

is $(\delta,2L)$ -model of $f$ with

[TABLE]

at the point $y$ w.r.t 2-norm.

Example A.2.

Augmented Lagrangians, [Devolder et al.(2014)Devolder, Glineur, and Nesterov]**

Let us consider

[TABLE]

and it’s dual problem

[TABLE]

If $z_{\delta}(y)$ is a solution of auxiliary max-problem in the following sense

[TABLE]

then

[TABLE]

is $(\delta,\mu^{-1})$ -model of $f$ with

[TABLE]

at the point $y$ w.r.t 2-norm.

Example A.3.

Moreau envelope of target function, [Devolder et al.(2014)Devolder, Glineur, and Nesterov]**

Let us consider optimization problem:

[TABLE]

Assume that function $f$ is a convex function and

[TABLE]

Then

[TABLE]

is $(\delta,L)$ -model of $f$ with

[TABLE]

at the point $y$ w.r.t 2-norm.

Remark A.4.

In paper [Lin et al.(2015)Lin, Mairal, and Harchaoui] authors propose generic acceleration scheme (Catalyst) for large class of optimization problems. They replace a function from optimization problem (1) $f$ with more well-defined functions (Moreau envelop of $f$ , see Example A.3) and apply accelerated proximal method. In our approach with $(\delta,L)$ -model we can try to use proximal model from example 2.8. However, due to the linear growth of $\alpha_{k}\sim k$ in a fast gradient method our auxiliary optimization problems would be ill-conditioned. We can overcome this problem using different approach which naturally combines with $(\delta,L)$ -model concept. In example A.5 we demonstrate this approach which relies heavily on example A.3.

Example A.5.

Catalyst acceleration, [Lin et al.(2015)Lin, Mairal, and Harchaoui]**

*Let us assume that function $f$ is $\mu_{f}$ -strongly convex function with $L_{f}$ -Lipschitz gradient w.r.t 2-norm. Let us replace optimization problem (1) on problem (38). These replacement gives us the following: *

There is a ‘closed-form’ solution of the auxiliary optimization problem (9) and (13). For instance, using $(\delta,L)$ -model from (A.3) we can show for auxiliary optimization problem from (13) that (assume that $V[u_{k}](x)=\frac{1}{2}\|x-y_{k}\|_{2}^{2}$ )

[TABLE]

is equivalent to

[TABLE] 2. 2.

In order to find $z_{L}(y_{k+1})$ we should solve ‘new’ auxiliary optimization problem

[TABLE]

which is well-defined with $(\mu_{f}+L)$ -strongly convex function and $(L_{f}+L)$ -Lipschitz gradient w.r.t 2-norm.

Philosophically these approach is very close to approach from [Lin et al.(2015)Lin, Mairal, and Harchaoui]. The problem is that instead of function $f$ we minimize function $f_{L}$ . However, we can use strong convexity of function $f$ to get around this. For simplicity, let us take $Q={\mathbb{R}}^{n}$ . It can be shown [Polyak(1987)] that

[TABLE]

where $x_{*}$ is an optimal solution of optimization problem (38). Using the fact ([Lemarechal C.(1997)]) that function $f_{L}$ has strong convexity parameter equal to

[TABLE]

we can show that

[TABLE]

Also we should note that function $f_{L}$ has $L$ -Lipschitz gradient, we need it further. We obtain that an $\varepsilon$ -solution of optimization problem (38) is an $\varepsilon$ -solution of optimization problem (1) with the same accuracy up to constant multiplier:

[TABLE]

Let us assume that we solve auxiliary optimization problem with a non-accelerated gradient method for strong convex functions (e.g. standard gradient method) with accuracy $O(\varepsilon^{2})$ , where $\varepsilon$ – is desired relative accuracy by function for original problem. For external optimization method we can take FGM for smooth $\mu$ -strongly convex functions with $L$ -Lipschitz gradient999Restarted Algorithm 2 (see Appendix D) in model environment of Example A.3.. We know that for this method the number of steps is equal to $O(\sqrt{L/\mu}\ln(1/\varepsilon))$ (follows from Example A.3). The total number of gradient calculations equals to number of steps of external optimization method multiplied by number of steps of non-accelerated gradient method. Therefore, the total number of gradient calculations equals to

[TABLE]

where constant $L$ is a free parameter. Let us take $L=L_{f}$ . Using (40) we have that the total number of gradient calculations equals to

[TABLE]

This means that we have accelerated convergence rate for optimization problem (1). In general, this approach, based on Example A.3, allows to accelerate non-accelerated different methods.

Example A.6.

Proximal Sinkhorn method**

Optimal transport (OT) [Monge(1781), Kantorovich(1942)] is currently generating an increasing attraction in statistics and machine learning communities [Bigot et al.(2012)Bigot, Klein, et al., Del Barrio et al.(2015)Del Barrio, Lescornel, and Loubes, Ebert et al.(2017)Ebert, Spokoiny, and Suvorikova, Le Gouic and Loubes(2017), Arjovsky et al.(2017)Arjovsky, Chintala, and Bottou, Solomon et al.(2014)Solomon, Rustamov, Guibas, and Butscher]. The most popular approach is entropic regularization and application of Sinkhorn’s algorithm [Cuturi(2013)]. As it is shown in [Gasnikov et al.(2015)Gasnikov, Dvurechensky, Kamzolov, Nesterov, Spokoiny, Stetsyuk, Suvorikova, and Chernov, Altschuler et al.(2017)Altschuler, Weed, and Rigollet], the regularization parameter needs to be chosen small. This can lead to instability of the algorithm. It is a bit better for the accelerated gradient descent [Dvurechensky et al.(2018a)Dvurechensky, Gasnikov, and Kroshnin], but this method can work slow for small regularization parameter.

We show, how our framework can be used to construct an alternative, which does not require to use Sinkhorn’s method with small regularization parameter.101010After we finished our derivations, we found that a close idea was considered in [Xie et al.(2018)Xie, Wang, Wang, and Zha]. Moreover, as far as we know in practice KL-proximal envelope of Sinkhorn’s algorithm was used even earlier (M. Cutiri, G. Peyer – private communication in Les Houches, 2016).

Optimal transport problem for calculating the Monge–Kantorovich–Wasserstein distance (MKW-distance) for discrete measures $l,w$ from the standard unit simplex is a linear programming (LP) problem

[TABLE]

where $\sum\limits_{i=1}^{n}l_{i}=\sum\limits_{j=1}^{n}w_{j}=1$ . We consider non-accelerated proximal-method with Bregman divergence $V[y](x)=\sum\limits_{i,j=1}^{n}x_{ij}\ln(x_{ij}/y_{ij})$ (see non adaptive variant of Algorithm 1 and Example 2.8). The step of this method reads as

[TABLE]

This $k$ -th auxiliary minimization problem is exactly the one, which is usually solved by the Sinkhorn’s algorithm. The idea of the method is alternating minimization for the dual problem [Cuturi(2013)]. The complexity of this method is [Franklin and Lorenz(1989), Beck(2015), Dvurechensky et al.(2018a)Dvurechensky, Gasnikov, and Kroshnin, Stonyakin et al.(2019)Stonyakin, Dvinskikh, Dvurechensky, Kroshnin, Kuznetsova, Agafonov, Gasnikov, Tyurin, Uribe, Pasechnyuk, and Artamonov]

[TABLE]

*where111111 By proper rounding of $x^{k}$ one can guarantee (without loss of generality) that $x^{k}_{ij}\geq\varepsilon/(2n^{2})$ that provide

$\frac{\bar{c}_{k}}{\gamma}=\frac{\max_{i,j}c_{ij}}{\gamma}+\ln\left(\frac{2n^{2}}{\varepsilon}\right).$

[TABLE]

and $\tilde{\varepsilon}$ is a relative accuracy (by function value). When $\gamma$ is small, the complexity is given by the second component and vise versa. At the same time, from Theorem 2.10 and Example 2.8 with inexact model w.r.t. chosen $V[y](x)$ as KL-divergence, it follows that, for any chosen $\gamma$ , the number of proximal iterations to obtain accuracy $\varepsilon$ is bounded by $\widetilde{O}\left(\gamma/\varepsilon\right)$ . Thus, we can trade-off the number of outer iterations and the the complexity of inner problem on each iteration by choosing appropriate gamma. It can be shown that for a special choice of $\gamma{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=O(\max_{k}\bar{c}_{k})}$ , the resulting complexity of the whole method can be estimated as $\widetilde{O}\left({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n^{4}/\varepsilon^{2}}\right)$ to obtain accuracy 121212Based on the Definition 4.1 and estimate (2) one can show the following dependence $\tilde{\varepsilon}=\widetilde{O}({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\varepsilon^{4}/(\gamma n^{4})})$ , where $\varepsilon$ is a given accuracy (in function value) for initial problem. To prove this fact one should use relation (36) with $\|\cdot\|=\|\cdot\|_{1}$ , $R=2$ , $\mu=\gamma$ . To bound $L$ we should modify $Q$ (transport polyhedral) by adding constraints: $x_{ij}\geq\varepsilon/(4{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n^{2}})$ , $i,j=1,...,n$ . Without loss of generality (see Algorithm 2 in [Dvurechensky et al.(2018a)Dvurechensky, Gasnikov, and Kroshnin]) we can consider $l$ and $w$ to be such that $\min_{i}l_{i}\geq\varepsilon/(2n)$ and $\min_{j}w_{i}\geq\varepsilon/(2n)$ . Hence, new polyhedral is well defined and the solution of modified problem is $O(\varepsilon)$ -solution (by function) of initial problem. For modified problem one can guarantee that $L=4\gamma{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n^{2}}/\varepsilon$ . According to (36) and Theorem B.1 one should solve auxiliary problem with accuracy $\tilde{\varepsilon}$ that guarantee $O(\varepsilon)=\tilde{\delta}=(5\gamma{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n^{2}}/\varepsilon)R\sqrt{2\tilde{\varepsilon}/\gamma}$ . The only problem is that now we can not directly apply Sinkhorn’s algorithm. This problem can be solved by trivial affine transformation of $x$ -space. This transformation reduces modified polyhedral to the standard one and we can use Sinkhorn’s algorithm. Such a transformation doesn’t change (in terms of $O(~{})$ ) the requirements to the accuracy. But one should note, that all these ‘modifications’ aren’t necessarily in practice. Since entropy is highly smooth function in positive orthant and zero $x-$ components are impossible due to the specificity of Sinkhorn’s algorithm we can consider more simple variant of stopping rule for Sinkhorn’s method in practice. We do $\bar{N}$ iterations of Sinkhorn’s algorithm for inner problem at each outer iteration. Than restart all the procedure from the very beginning with $\bar{N}:=2\bar{N}$ , etc. At some moment we detect that further step $\bar{N}:=2\bar{N}$ doesn’t change significantly the quality of the solution and we stop here. One can easily show that all these restarts increase the total complexity of the procedure no more than 4 times in comparison with the procedure with (unknown) optimal value of $\bar{N}$ . $\varepsilon$ in approximation the non-regularized MKW-distance. In practice this method (Prox Sinkhorn) works significantly better. Note, that the best known (for the moment) theoretical bound for transport problem is $\widetilde{O}\left(n^{2}/\varepsilon\right)$ [Blanchet et al.(2018)Blanchet, Jambulapati, Kent, and Sidford], whereas Sinkhorn’s algorithm has the complexity $\widetilde{O}\left(n^{2}/\varepsilon^{2}\right)$ .

Figure 1 shows experimental comparison of Sinkhorn’s method and proximal Sinkhorn’s method. For the Sinkhorn’s method $\gamma$ was chosen in accordance with the theoretical bound $\widetilde{O}\left(\varepsilon\right)$ . For the proximal Sinkhorn’s algorithm, we used the following idea of adaptivity to the parameter $\gamma$ . In the first iteration of the proximal method, the problem is solved with overestimated $\gamma$ parameter value. Then we set $\gamma:=\gamma/2$ and the problem is solved with the updated value of the parameter, and so on, until a significant increase (for example, 10 times) in the complexity of the auxiliary entropy-linear programming problem in comparison with the initial complexity is detected. The found value of parameter $\gamma$ can be used in next iterations of the proximal method. Also the starting point for the Sinkhorn’s method on the next outer iteration can be chosen as the solution of the auxiliary problem from the previous iteration.

In the experiments we use a standard MNIST dataset with images scaled to a size $10\times 10$ . The vectors $l$ and $w$ contain the pixel intensities of the first and second images respectively ( $n=(width)^{2}=100$ ). The value of $c_{ij}$ is equal to the Euclidean distance between the $i$ -th pixel from the vector $l$ and the $j$ -th pixel from the vector $w$ on the image pixel grid.

It seems that the described example have different further generalization, e.g. for or Greenkhorn algorithm (instead of Sinkhorn) [Lin et al.(2019)Lin, Ho, and Jordan] or can be spread on Wasserstein Barycenter calculation problem [Kroshnin et al.(2019)Kroshnin, Dvinskikh, Dvurechensky, Gasnikov, Tupitsa, and Uribe] .

Example A.7.

Universal method, [Nesterov(2015)]**

In this example we present a special case of $(\delta,L)$ -model which is closely related to universal method (see [Nesterov(2015)]). We show that for some choice of $(\delta,L)$ -model w.r.t. $\|\cdot\|$ and $\delta_{k}$ our fast gradient method has the same rate of convergence as accelerated version of the standard universal method. Let us consider function $f$ is a convex function with Holder continuous (sub)gradient w.r.t. $\|\cdot\|$ :

[TABLE]

For functions with Holder continuous (sub)gradient we can write the following inequality ([Nesterov(2015)]):

[TABLE]

where

[TABLE]

and $\delta>0$ is a free parameter.

From the last inequality one can see that we can take $\psi_{\delta_{k}}(x,y)=\langle\nabla f(y),x-y\rangle$ and $f_{\delta_{k}}(y)=f(y)$ .

Let us take

[TABLE]

where $\varepsilon$ is the required accuracy of the solution by function. From theorem 2.12 with our assumptions we have the following convergence rate:

[TABLE]

As in [Nesterov(2015)] we can show that

[TABLE]

Using (44) we can show the following upper bound for the number of steps for getting $\epsilon$ -solution:

[TABLE]

This estimate is optimal (see [Guzmán and Nemirovski(2015)]).

Example A.8.

Universal conditional gradient (Frank–Wolfe) method**

Let us consider convex problem (1), where $f$ has Holder continuous (sub)gradient w.r.t. $\|\cdot\|$ . Assume that $V[y](x)\leq R_{Q}^{2}$ for all $x,y\in Q$ . Sometimes in practice auxiliary problem (13) can be hard ([Ben-Tal and Nemirovski(2015), Nesterov(2018)]). In131313For details see also [Bubeck et al.(2015), Ben-Tal and Nemirovski(2015), Harchaoui Z., Juditsky A., Nemirovski A.(2015), Anikin et al.(2015)Anikin, Gasnikov, Gornov, Kamzolov, Maximov, and Nesterov, Nesterov(2018)]. [Jaggi M.(2013)] it was shown that conditional gradient method (Frank–Wolfe) can be useful for some of these problems. In algorithms 1 and 2 from sections 2.2 and 2.3 we have auxiliary optimization problems (9) and (13). Instead of functions in auxiliary optimization problems (9) and (13) let us take

[TABLE]

and

[TABLE]

respectively. With this substitution our method from section 2.2 becomes Frank–Wolfe method. Further we show that Frank–Wolfe is a special case of methods from sections 2.2 and 2.3. Moreover, we provide universal Frank–Wolfe method combining ideas from Frank–Wolfe method and universal method [Nesterov(2015)]. Let us look at this substitution from the view of an error $\widetilde{\delta}_{k}$ where $\widetilde{\delta}_{k}$ is an error in terms of definition (4.1). We can show that it is enough to take $\widetilde{\delta}_{k}=2{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L_{k+1}}R^{2}_{Q}$ for all $k\geq 0$ , where $R_{Q}$ is a diameter of a set $Q$ . Also let us take

[TABLE]

where $\varepsilon$ is the accuracy of the solution by function. It is enough to do

[TABLE]

steps in order to find an $\varepsilon$ -solution of the optimization problem. Constants $L_{\nu}$ and $\nu$ are defined in example A.7. Let us prove it. Let us first show that it enough to take $\widetilde{\delta}_{k}=2{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}L_{k+1}}R^{2}_{Q}$ for all $k\geq 0$ :

[TABLE]

Thus the point $u_{k+1}$ is a $\widetilde{\delta}_{k}$ -solution in sense of Definition 4.1.

It is left to proof inequality (45). Using Theorem C.1 we can show:

[TABLE]

Using (44) we can show the following upper bound for the number of steps for getting $\epsilon$ -solution:

[TABLE]

Appendix B Proof of Theorem 2.10

Let us propose generalization of theorem 2.12 where we take in account inaccuracies arise from the inexact solution of auxiliary problems. The first sequence $\{\delta_{k}\}_{k=0}^{N-1}$ is a sequence such that for any $k$ there is a $(\delta_{k},L)$ -model for $f$ (w.r.t. $V[y](x)$ and w.r.t. $\|\cdot\|$ in Appendix C). Numbers in the second sequence $\{\widetilde{\delta}_{k}\}_{k=0}^{N-1}$ are the accuracies of the solution of the auxiliary problem in terms of Definition 4.1.

Theorem B.1.

Let $V[x_{0}](x_{*})\leq R^{2}$ , where $x_{0}$ is the starting point, and $x_{*}$ is the closest point of the minimum to the point $x_{0}$ in the sense of Bregman divergence, and

[TABLE]

For the proposed algorithm we have the following convergence rate:

[TABLE]

The full proof of this theorem includes two lemmas. Let us formulate and prove lemmas.

Lemma B.2.

Let $\psi(x)$ be a convex function and

[TABLE]

where $\beta\geq 0$ . Then

[TABLE]

Proof B.3.

By definition 4.1:

[TABLE]

Then inequality

[TABLE]

and equality

[TABLE]

complete the proof.

Lemma B.4.

$\forall x\in Q$ * we have*

[TABLE]

Proof B.5.

[TABLE]

1

— from lemma B.2 with $\psi(x)=\psi_{\delta_{k}}(x,x_{k})$ and $\beta=1/\alpha_{k+1}$ .

Remark B.6.

Let us show that $L_{k}\leq 2L\quad\forall k\geq 0$ . For $k=0$ this is true from the fact that $L_{0}\leq L$ . For $k\geq 1$ this follows from the fact that we leave the inner cycleearlier than $L_{k}$ will be greater than $2L$ . The exit from the cycle is guaranteed by the condition that there is an $(\delta_{k},L)$ -model for $f(x)$ at any point $x\in Q$ .

We are ready to proof the theorem.

Proof B.7.

Let us sum up the inequality from Lemma B.4 at $k=0,...,N-1$

[TABLE]

With $x=x_{*}$ we have that

[TABLE]

Since $V[x_{N}](x_{*})\geq 0$ we obtain inequality

[TABLE]

Let us divide both parts by $A_{N}$ .

[TABLE]

Using the convexity of $f(x)$ we can show that

[TABLE]

Remains only to prove that

[TABLE]

As it follows from definition 2.1 and remark B.6 for all $k\geq 0$ $L_{k}\leq 2L$ . Thus, we have that

[TABLE]

and

[TABLE]

The estimate of the total number of oracle calls is estimated in the same way as in [Nesterov and Polyak(2006)].

Appendix C Proof of Theorem 2.12

Theorem C.1.

Let $V[x_{0}](x_{*})\leq R^{2}$ , where $x_{0}$ is the starting point and $x_{*}$ is the nearest minimum point to $x_{0}$ in the sense of Bregman divergence. For the proposed algorithm the following inequality holds:

[TABLE]

Let us proof auxiliary lemmas.

Lemma C.2.

Suppose that for sequence $\alpha_{k}$ it is satisfied

[TABLE]

where $L_{k}\leq 2L\,\forall k\geq 0$ (see Remark B.6). Then $\forall k\geq 1$ the following inequality holds:

[TABLE]

Proof C.3.

Let $k=1$ .

[TABLE]

and

[TABLE]

Let $k\geq 2$ , then

[TABLE]

Solving this quadratic equation we will take the largest root, therefore

[TABLE]

By induction, let the inequality (48) be true for $k$ , then:

[TABLE]

The last inequality follows from the induction hypothesis. Finally, we obtain, that

[TABLE]

and

[TABLE]

Lemma C.4.

For each $x\in Q$ we have:

[TABLE]

Proof C.5.

[TABLE]

1

— from $A_{k}=L_{k}\alpha^{2}_{k}$ .

2

— from lemma B.2 and (3).

We are ready to proof the theorem.

Proof C.6.

Let us sum up the inequality of lemma C.4 for $k=0,...,N-1$

[TABLE]

and

[TABLE]

Let us take $x=x_{*}$ :

[TABLE]

We divide both sides of the inequality by $A_{N}$ and finally we get, that

[TABLE]

1

— from lemma C.2.

Appendix D The Case of Strongly Convex Objective

Now we consider the case of a strongly convex objective. The following assumption allows us to prove a lin aerrate of convergence for Algorithm 1.

Definition D.1.

Say that the function $f$ is a right relative $\mu$ -strongly convex if the following inequality

[TABLE]

holds.

Recall that for a strongly convex in the usual sense of the functional $f$ the following inequality will be true

[TABLE]

Remark D.2.

Let us remind that if $d(x-y)\leq C_{n}\left\lVert x-y\right\rVert^{2}$ for $C_{n}=O(\log n)$ (where n is dimension of vectors from $Q$ ), then $V[y](x)\leq C_{n}\left\lVert x-y\right\rVert^{2}$ . This assumption is true for many standard proximal setups. In this case the condition of $(\mu C_{n})$ -strong convexity

[TABLE]

entails right relative strong convexity:

[TABLE]

After $k$ iterations of non-adaptive version of Algorithm 1 with a constant step $\alpha_{i}=\frac{1}{L}$

( $i=1,...,k$ ), using lemma B.2, we have:

[TABLE]

therefore,

[TABLE]

Further, $\psi_{\delta}(x,y)$ is a ( $\delta$ , L)-model w.r.t. $V[y](x)$ and from

[TABLE]

we get

[TABLE]

Now (49) means

[TABLE]

Using right relative strong convexity, we have:

[TABLE]

or

[TABLE]

Considering (50), we obtain:

[TABLE]

For $x=x_{*}$ we have:

[TABLE]

Therefore, we have

[TABLE]

Let $y_{k}=\operatorname*{argmin}_{i=1,...,k}(f(x_{i}))$ . Then using this definition and the fact that

[TABLE]

we obtain

[TABLE]

and, using the fact that $e^{-x}\geq 1-x~{}~{}\forall x\geq 0$ , we conclude that

[TABLE]

Let $x=x_{*}$ in (51), from which $f(x_{*})\leq f(x_{k+1})$ and

[TABLE]

i.e.

[TABLE]

Further,

[TABLE]

Therefore, taking into account the following fact $\sum\limits_{i=0}^{k}\left(1-\frac{\mu}{L}\right)^{i}<\frac{1}{1-\left(1-\frac{\mu}{L}\right)}=\frac{L}{\mu}$ , we obtain

[TABLE]

Thus, we have the following result

Theorem D.3.

Assume that function $f$ is a right relatively strongly convex and $\psi_{\delta}(x,y)$ is a ( $\delta$ , L)-model w.r.t. $V[y](x)$ . Then, after of $k$ iterations of non-adaptive version of Algorithm 1, $f$ satisfies (52) and (53).

In other words, if function satisfies right relative strong convexity and relative smoothness, then after performing $O(\log(\frac{1}{\varepsilon}))$ iterations we can achive an accuracy of $\varepsilon$ accurate to term $O(\delta+\widetilde{\delta})$ .

Let us consider the case of a strongly convex functional $f$ and show how to accelerate the work of Algorithms 1 and 2 using the restart technique. Let us assume that

[TABLE]

Note the this assumption is natural, e.g. $\psi_{\delta}(x,y):=\langle\nabla f(y),x-y\rangle\,\,\,\forall x,y\in Q$ . We also modify the concept of relative $\mu$ -strongly convexity in the following way

Definition D.4.

Say that the function $f$ is a left relative $\mu$ -strongly convex if the following inequality

[TABLE]

holds.

Note that concepts of right and left relative strongly convexity from Definitions D.1 and D.4 are equivalent in the case of assumption from Remark D.2 ( $V[x](y)\leq C_{n}\|x-y\|^{2}$ for each $x,y\in Q$ ).

Theorem D.5.

Let $f$ be a left relative $\mu$ -strongly convex function and $\psi_{\delta}(x,y)$ is a ( $\delta$ , L)-model w.r.t. $V[y](x)$ . Then, using the restarts of Algorithm 1, we obtain the estimate

[TABLE]

for a given $\varepsilon>0$ . The total number for iterations of Algorithm 1 not exceeding

[TABLE]

Proof D.6.

By Definition D.4 and Theorem B.1 we have

[TABLE]

Further, due to the following inequality

[TABLE]

let’s choose the smallest number of steps $N_{1}$ :

[TABLE]

Similarly*, after the $2$ nd restart ( $N_{2}$ operations)*

[TABLE]

After the $p$ -th restart ( $N_{p}$ * operations)*

[TABLE]

Choose $p$ such that

[TABLE]

After $p=\left\lceil\log_{2}{\frac{R^{2}}{\varepsilon}}\right\rceil$ restarts we have

[TABLE]

The number of iterations $N_{k}\ (k=\overline{1,p})$ on the k-th restart of Algorithm 1 is estimated from (57):

[TABLE]

So, we can put $N_{k}=\left\lceil{\cfrac{4L}{\mu}}\right\rceil$ and (56) holds.

We show that using the restart technique can also accelerate the work of non-adaptive version of Algorithm 2 ( $L_{k+1}=L$ ) for ( $\delta$ , L)-model $\psi_{\delta}(x,y)$ w.r.t. norm $\|\cdot\|$ and relative $\mu$ -strogly convex function $f$ in sense Definition D.4:

[TABLE]

for each $x,y\in Q$ . By Theorem C.1:

[TABLE]

Consider the case of relatively $\mu$ -strongly convex function $f$ . We will use the restart technique to obtain the method for strongly convex functions. By $\eqref{FGeq1}$ and Definition D.4:

[TABLE]

Let’s choose $N_{1}$ so that the following inequality holds:

[TABLE]

We restart method as

[TABLE]

From (60):

[TABLE]

Let’s choose

[TABLE]

Then after $N_{1}$ iterations we restart method. Similarly, we restart after $N_{2}$ iterations, such that $V[x_{N_{2}}](x_{*})\leq\frac{V[x_{N_{1}}](x_{*})}{4}$ . We obtain

[TABLE]

So, after $p$ -th restart the total number of iterations:

[TABLE]

Now let’s consider how many iterations is needed to achieve accuracy $\varepsilon=f(x_{N_{p}})-f(x_{*})$ . From (59) and (62) we take

[TABLE]

and total number of iterations:

[TABLE]

Let’s estimate the accuracy $\varepsilon$ we can achieve. For each $k={1,p}$ we need to enforce the following inequality:

[TABLE]

where $N_{k}=\left\lceil 6\sqrt{\frac{L}{\mu}}\right\rceil$ . So, we can achieve the following accuracy:

[TABLE]

Appendix E A proof of Theorem 3.7 for the case of inexactness for auxiliary problem

For Algorithm 3 we may also take into account inexactness for auxiliary problems on iterations (see Definition 4.1).

Theorem E.1.

For Algorithm 4 the following inequalities hold

[TABLE]

The method works no more than

[TABLE]

iterations.

Proof E.2.

After $(k+1)$ -th iteration ( $k=0,1,2\ldots$ ) we have for each $u\in Q$ :

[TABLE]

and

[TABLE]

Taking into account (28), we obtain

[TABLE]

So, the following inequality

[TABLE]

holds. By virtue of (22) and the choice of $L_{0}\leqslant 2L$ , it is guaranteed that

[TABLE]

and we have

[TABLE]

Appendix F On the concept of a \texorpdfstring $(\delta,L)$ dL-model for saddle point problems

The solution of variational inequalities reduces the so-called saddle problems, in which for a convex in $u$ and concave in $v$ functional $f(u,v):\mathbb{R}^{n_{1}+n_{2}}\rightarrow\mathbb{R}$ ( $u\in Q_{1}\subset\mathbb{R}^{n_{1}}$ and $v\in Q_{2}\subset\mathbb{R}^{n_{2}}$ ) needs to be found such that:

[TABLE]

for arbitrary $u\in Q_{1}$ and $v\in Q_{2}$ . Let $Q=Q_{1}\times Q_{2}\subset\mathbb{R}^{n_{1}+n_{2}}$ . For $x=(u,v)\in Q$ , we assume that $||x||=\sqrt{||u||_{1}^{2}+||v||_{2}^{2}}$ ( $||\cdot||_{1}$ and $||\cdot||_{2}$ are the norms in the spaces $\mathbb{R}^{n_{1}}$ and $\mathbb{R}^{n_{2}}$ ). We agree to denote $x=(u_{x},v_{x}),\;y=(u_{y},v_{y})\in Q$ .

It is well known that for a sufficiently smooth function $f$ with respect to $u$ and $v$ the problem (68) reduces to VI with an operator

[TABLE]

For saddle-point problems we propose some adaptation of the concept of the ( $\delta$ , L)-model for abstract variational inequality (w.r.t. $V[y](x)$ or $\|\cdot\|$ ).

Definition F.1.

We say that the function $\psi_{\delta}(x,y)$ $(\psi_{\delta}:\mathbb{R}^{n_{1}+n_{2}}\times\mathbb{R}^{n_{1}\times n_{2}}\rightarrow\mathbb{R})$ is a $(\delta,L)$ -model w.r.t. $V[y](x)$ for the saddle-point problem (68) if the following properties hold for each $x,y,z\in Q$ :

(i)

$\psi_{\delta}(x,y)$ * convex in the first variable;* 2. (ii)

$\psi_{\delta}(x,x)=0$ ; 3. (iii)

(abstract $\delta$ -monotonicity)

[TABLE] 4. (iv)

(generalized relative smoothness)

[TABLE]

for some fixed values $L>0$ , $\delta>0$ ; 5. (v)

[TABLE]

Example F.2.

The proposed concept of the ( $\delta$ , L)-model for saddle-point problems is quite applicable, for example, for composite saddle problems of the form considered in the popular article [Chambolle and Pock(2011)]:

[TABLE]

for some convex in $u$ and concave in $v$ subdifferentiable functions $\tilde{f}$ , as well as convex functions $h$ and $\varphi$ . In this case, you can put

[TABLE]

where

[TABLE]

Indeed, from subgradient inequalities:

[TABLE]

Therefore, we have

[TABLE]

from where

[TABLE]

Theorem E.1 implies

Theorem F.3.

If for the saddle problem (68) there is a $(\delta,L)$ -model $\psi(x,y)$ w.r.t. $V[y](x)$ , then after stopping the algorithm we get a point

[TABLE]

for which the inequality is true:

[TABLE]

Appendix G Modelling for Strongly Monotone VI

We also can consider $\mu$ -strongly monotone ( $\delta$ , L)-model for VI with the following more strong version of (16) :

[TABLE]

for some fixed number $\mu>0$ (here we put $\delta=0$ ). Also we assume that $\psi_{\delta}(x,y)$ is continuous by $x$ and $y$ . We slightly modify the assumptions on prox-function $d(x)$ . Namely, we assume that $0=\arg\min_{x\in Q}d(x)$ and that $d$ is bounded on the unit ball in the chosen norm $\|\cdot\|$ , that is

[TABLE]

where $\Omega$ is some known constant. Note that for standard proximal setups, $\Omega=O(\ln\text{dim}E)$ . Finally, we assume that we are given a starting point $x_{0}\in Q$ and a number $R_{0}>0$ such that $\|x_{0}-x_{*}\|^{2}\leq R_{0}^{2}$ , where $x^{*}$ is the solution to abstract VI. The procedure of restating of Algorithm 3 restating is applicable for abstract strongly monotone variational inequalities.

Theorem G.1.

Assume that $\psi$ is satisfied to (77). Also assume that the prox function $d(x)$ satisfies (78) and the starting point $x_{0}\in Q$ and a number $R_{0}>0$ are such that $\|x_{0}-x_{*}\|^{2}\leq R_{0}^{2}$ , where $x_{*}$ is the solution to (17). Then, for $p\geq 0$

[TABLE]

and the point $x_{p}$ returned by natural analogue of Algorithm 5 with restarts of Algorithm 3 satisfies $\|x_{p}-x_{*}\|^{2}\leq\varepsilon$ . The total number of iterations of the inner Algorithm 3 does not exceed

[TABLE]

where $\Omega$ is satisfied to (78).

Proof G.2.

We show by induction that, for $p\geq 0$ ,

[TABLE]

which leads to the statement of the Theorem. For $p=0$ this inequality holds by the Theorem assumption. Assuming that it holds for some $p\geq 0$ , our goal is to prove it for $p+1$ considering the outer iteration $p+1$ . Observe that the function $d_{p}(x)$ defined in Algorithm 5 is 1-strongly convex w.r.t. the norm $\|\cdot\|/R_{p}$ .

This means that, at each step $k$ of inner Algorithm 3, $L_{N_{p}}$ changes to $L_{N_{p}}\cdot R_{p}^{2}$ . Using the definition of $d_{p}(\cdot)$ and (78), we have, since $x_{p}=\arg\min_{x\in Q}d_{p}(x)$

[TABLE]

Denote by

[TABLE]

Thus, by Theorem 3.7, taking $u=x_{*}$ , we obtain

[TABLE]

Since the operator $\psi_{\delta}$ is continuous and abstract monotone, we can assume that the solution to weak VI (15) is also a strong solution and

[TABLE]

This and (77) gives, that for each $k=0,...,N_{p}-1$ ,

[TABLE]

Thus, by convexity of the squared norm, we obtain

[TABLE]

Using the stopping criterion $S_{N_{p}}\geq\frac{\Omega}{\mu}$ , we obtain

[TABLE]

which finishes the induction proof.

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Altschuler et al.(2017)Altschuler, Weed, and Rigollet] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approxfimation algorithms for optimal transport via sinkhorn iteration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 1961–1971. Curran Associates, Inc., 2017. ar Xiv:1705.09634.
2[Anikin et al.(2015)Anikin, Gasnikov, Gornov, Kamzolov, Maximov, and Nesterov] Anton Anikin, Alexander Gasnikov, Alexander Gornov, Dmitry Kamzolov, Yury Maximov, and Yurii Nesterov. Efficient numerical methods to solve sparse linear equations with application to pagerank. ar Xiv preprint ar Xiv:1508.07607 , 2015.
3[Arjovsky et al.(2017)Arjovsky, Chintala, and Bottou] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv:1701.07875 , 2017.
4[Bauschke et al.(2016)Bauschke, Bolte, and Teboulle] Heinz H Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research , 42(2):330–348, 2016.
5[Beck(2015)] Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization , 25(1):185–209, 2015.
6[Beck and Teboulle(2009)] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009. 10.1137/080716542 . URL https://doi.org/10.1137/080716542 . · doi ↗
7[Ben-Tal and Nemirovski(2015)] Aaron Ben-Tal and Arkadi Nemirovski. Lectures on Modern Convex Optimization (Lecture Notes) . Personal web-page of A. Nemirovski, 2015. URL http://www 2.isye.gatech.edu/ nemirovs/Lect_Mod Conv Opt.pdf .
8[Bigot et al.(2012)Bigot, Klein, et al.] Jérémie Bigot, Thierry Klein, et al. Consistent estimation of a population barycenter in the wasserstein space. Ar Xiv e-prints , 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Inexact Model: A Framework for Optimization and Variational Inequalities

Abstract

keywords:

1 Introduction

2 Inexact Model for Minimization

2.1 Definitions and Examples

Definition 2.1**.**

Remark 2.2**.**

Remark 2.3**.**

Remark 2.4**.**

Example 2.5**.**

Example 2.6**.**

Example 2.7**.**

Example 2.8**.**

Example 2.9**.**

2.2 Gradient Method with Inexact Model

Theorem 2.10**.**

Remark 2.11**.**

2.3

Theorem 2.12**.**

3 Inexact Model for Variational Inequalities

Example 3.1**.**

Definition 3.2**.**

Remark 3.3**.**

Remark 3.4**.**

Example 3.5**.**

Example 3.6**.**

Theorem 3.7**.**

Proof 3.8**.**

Remark 3.9**.**

4 Concluding remarks

Definition 4.1**.**

Appendix A Model examples

Example A.1**.**

Example A.2**.**

Example A.3**.**

Remark A.4**.**

Example A.5**.**

Example A.6**.**

Example A.7**.**

Example A.8**.**

Appendix B Proof of Theorem 2.10

Theorem B.1**.**

Lemma B.2**.**

Proof B.3**.**

Lemma B.4**.**

Proof B.5**.**

Remark B.6**.**

Proof B.7**.**

Appendix C Proof of Theorem 2.12

Theorem C.1**.**

Lemma C.2**.**

Proof C.3**.**

Lemma C.4**.**

Proof C.5**.**

Proof C.6**.**

Appendix D The Case of Strongly Convex Objective

Definition D.1**.**

Remark D.2**.**

Theorem D.3**.**

Definition D.4**.**

Theorem D.5**.**

Proof D.6**.**

Appendix E A proof of Theorem 3.7 for the case of inexactness for auxiliary problem

Theorem E.1**.**

Proof E.2**.**

Appendix F On the concept of a \texorpdfstring(δ,L)(\delta,L)(δ,L)dL-model for saddle point problems

Definition F.1**.**

Example F.2**.**

Theorem F.3**.**

Appendix G Modelling for Strongly Monotone VI

Theorem G.1**.**

Proof G.2**.**

Definition 2.1.

Remark 2.2.

Remark 2.3.

Remark 2.4.

Example 2.5.

Example 2.6.

Example 2.7.

Example 2.8.

Example 2.9.

Theorem 2.10.

Remark 2.11.

Theorem 2.12.

Example 3.1.

Definition 3.2.

Remark 3.3.

Remark 3.4.

Example 3.5.

Example 3.6.

Theorem 3.7.

Proof 3.8.

Remark 3.9.

Definition 4.1.

Example A.1.

Example A.2.

Example A.3.

Remark A.4.

Example A.5.

Example A.6.

Example A.7.

Example A.8.

Theorem B.1.

Lemma B.2.

Proof B.3.

Lemma B.4.

Proof B.5.

Remark B.6.

Proof B.7.

Theorem C.1.

Lemma C.2.

Proof C.3.

Lemma C.4.

Proof C.5.

Proof C.6.

Definition D.1.

Remark D.2.

Theorem D.3.

Definition D.4.

Theorem D.5.

Proof D.6.

Theorem E.1.

Proof E.2.

Appendix F On the concept of a \texorpdfstring $(\delta,L)$ dL-model for saddle point problems

Definition F.1.

Example F.2.

Theorem F.3.

Theorem G.1.

Proof G.2.