On Modification of an Adaptive Stochastic Mirror Descent Algorithm for   Convex Optimization Problems with Functional Constraints

Mohammad S. Alkousa

arXiv:1904.09513·math.OC·January 22, 2020

On Modification of an Adaptive Stochastic Mirror Descent Algorithm for Convex Optimization Problems with Functional Constraints

Mohammad S. Alkousa

PDF

Open Access

TL;DR

This paper introduces a modified adaptive stochastic mirror descent algorithm for constrained convex optimization, improving efficiency by selectively considering constraints and providing convergence analysis and numerical validation.

Contribution

It proposes a new modification to existing algorithms that reduces computational time by not evaluating all constraints at every step, while maintaining optimal convergence rates.

Findings

01

The modified algorithm achieves the same optimal complexity of O(ε^{-2})

02

Numerical experiments demonstrate improved efficiency over standard methods

03

The approach effectively handles multiple convex functional constraints in stochastic settings

Abstract

This paper is devoted to a new modification of a recently proposed adaptive stochastic mirror descent algorithm for constrained convex optimization problems in the case of several convex functional constraints. Algorithms, standard and its proposed modification, are considered for the type of problems with non-smooth Lipschitz-continuous convex objective function and convex functional constraints. Both algorithms, with an accuracy $ε$ of the approximate solution to the problem, are optimal in the terms of lower bounds of estimates and have the complexity $O (ε^{- 2})$ . In both algorithms, the precise first-order information, which connected with (sub)gradient of the objective function and functional constraints, is replaced with its unbiased stochastic estimates. This means that in each iteration, we can still use the value of the objective function and…

Figures2

Click any figure to enlarge with its caption.

Tables2

Table 1. Table 1 : Results of Algorithms 1 and 2 , for Examples 1 and 2 , in ℝ 1500 superscript ℝ 1500 \mathbb{R}^{1500} .

	Example 1
$N$	Algorithm 1		Algorithm 2
	Iterations	Time (sec)	Iterations	Time (sec)
$75$	30 157	618.79	27 007	22.47
$100$	12 827	254.34	11 071	10.04
$150$	7 452	139.99	5 713	4.62
	Example 2
$N$	Algorithm 1		Algorithm 2
	Iterations	Time (sec)	Iterations	Time (sec)
$75$	104 513	2008.12	90 154	82.38
$100$	18 814	358.02	17 584	15.3
$150$	5 451	115.47	4 834	5.45

Table 2. Table 2 : The results of Algorithms 1 and 2 , for Examples 1 and 2 , with different values of N 𝑁 N .

	Example 1
$N$	Algorithm 1		Algorithm 2
	Iterations	Time (sec)	Iterations	Time (sec)
$1 000$	6 717	9.770	5 366	0.476
$5 000$	5 726	7.975	5 334	0.452
$10 000$	8 017	11.076	5 574	0.500
$15 000$	6 427	8.890	5 243	0.445
$25 000$	6 775	9.530	5 348	0.474
$50 000$	7 339	10.232	6 187	0.582
$75 000$	6 599	9.160	5 287	0.452
$100 000$	6 235	8.665	5 400	0.456
$125 000$	6 709	9.175	6 095	0.512
$150 000$	6 928	9.671	5 360	0.471
	Example 2
$N$	Algorithm 1		Algorithm 2
	Iterations	Time (sec)	Iterations	Time (sec)
$1 000$	6 519	10.496	5 178	0.656
$2 500$	6 238	9.750	4 634	0.523
$5 000$	5 364	8.287	4 615	0.679
$7 500$	5 862	9.255	5 029	0.677
$10 000$	6 025	9.331	4 506	0.569
$12 500$	5 341	10.687	4 688	0.672
$15 000$	6 227	12.981	4 995	0.576
$17 500$	5 847	9.509	4 616	0.603
$20 000$	5 486	8.515	4 760	0.620
$22 500$	6 294	10.140	4 551	0.585
$25 000$	6 055	11.598	4 534	0.596

Equations88

∥ h ∥_{*} = x max {⟨ h, x ⟩, ∥ x ∥ \leq 1},

∥ h ∥_{*} = x max {⟨ h, x ⟩, ∥ x ∥ \leq 1},

∣ f (x) - f (y) ∣ \leq M_{f} ∥ x - y ∥ \forall x, y \in Q,

∣ f (x) - f (y) ∣ \leq M_{f} ∥ x - y ∥ \forall x, y \in Q,

∣ g_{j} (x) - g_{j} (y) ∣ \leq M_{g} ∥ x - y ∥ \forall x, y \in Q, j = \overline{1, m} .

∣ g_{j} (x) - g_{j} (y) ∣ \leq M_{g} ∥ x - y ∥ \forall x, y \in Q, j = \overline{1, m} .

g (x) = j = \overline{1, m} max {g_{j} (x)}, ∣ g (x) - g (y) ∣ \leq M_{g} ∥ x - y ∥ \forall x, y \in Q .

g (x) = j = \overline{1, m} max {g_{j} (x)}, ∣ g (x) - g (y) ∣ \leq M_{g} ∥ x - y ∥ \forall x, y \in Q .

f (x) \to x \in Q, g (x) \leq 0 min .

f (x) \to x \in Q, g (x) \leq 0 min .

E [\nabla f (x, ξ)] = \nabla f (x) \in \partial f (x) and E [\nabla g (x, ζ)] = \nabla g (x) \in \partial g (x),

E [\nabla f (x, ξ)] = \nabla f (x) \in \partial f (x) and E [\nabla g (x, ζ)] = \nabla g (x) \in \partial g (x),

∥\nabla f (x, ξ) ∥_{*} \leq M_{f} and ∥\nabla g (x, ζ) ∥_{*} \leq M_{g}, a . s . in ξ, ζ .

∥\nabla f (x, ξ) ∥_{*} \leq M_{f} and ∥\nabla g (x, ζ) ∥_{*} \leq M_{g}, a . s . in ξ, ζ .

⎩ ⎨ ⎧ f (x) = \frac{1}{2} ⟨ A x, x ⟩ \to x \in S_{n} (1) min, s . t . g (x) = i = \overline{1, m} max {⟨ c_{i}, x ⟩} \leq 0,

⎩ ⎨ ⎧ f (x) = \frac{1}{2} ⟨ A x, x ⟩ \to x \in S_{n} (1) min, s . t . g (x) = i = \overline{1, m} max {⟨ c_{i}, x ⟩} \leq 0,

E [A^{⟨ ξ ⟩}]

E [A^{⟨ ξ ⟩}]

= A^{⟨ 1 ⟩} x_{1} + \dots + A^{⟨ n ⟩} x_{n} = A x,

d (y) \geq d (x) + ⟨ \nabla d (x), y - x ⟩ + \frac{1}{2} ∥ y - x ∥^{2} \forall x, y \in Q,

d (y) \geq d (x) + ⟨ \nabla d (x), y - x ⟩ + \frac{1}{2} ∥ y - x ∥^{2} \forall x, y \in Q,

x_{*} \in X_{*} min d (x_{*}) \leq Θ_{0}^{2} .

x_{*} \in X_{*} min d (x_{*}) \leq Θ_{0}^{2} .

V_{x} (y) = d (y) - d (x) - ⟨ \nabla d (x), y - x ⟩ .

V_{x} (y) = d (y) - d (x) - ⟨ \nabla d (x), y - x ⟩ .

x, y \in Q sup V_{x} (y) \leq Θ_{0}^{2} .

x, y \in Q sup V_{x} (y) \leq Θ_{0}^{2} .

\mathrm{Mirr}_{x}(p)=\arg\min\limits_{u\in Q}\big{\{}\langle p,u\rangle+V_{x}(u)\big{\}}.

\mathrm{Mirr}_{x}(p)=\arg\min\limits_{u\in Q}\big{\{}\langle p,u\rangle+V_{x}(u)\big{\}}.

E [f (\overset{x}{^})] - f (x_{*}) \leq ε and g (\overset{x}{^}) \leq ε .

E [f (\overset{x}{^})] - f (x_{*}) \leq ε and g (\overset{x}{^}) \leq ε .

h (f (y) - f (x)) \leq \frac{h ^{2}}{2} ∥\nabla f (y, ξ) ∥_{*}^{2} + V_{y} (x) - V_{z} (x) + h ⟨ \nabla f (y, ξ) - \nabla f (y), y - x ⟩ .

h (f (y) - f (x)) \leq \frac{h ^{2}}{2} ∥\nabla f (y, ξ) ∥_{*}^{2} + V_{y} (x) - V_{z} (x) + h ⟨ \nabla f (y, ξ) - \nabla f (y), y - x ⟩ .

N = ⌈ \frac{4 max { M _{f}^{2} , M _{g}^{2} } Θ _{0}^{2}}{ε ^{2}} ⌉

N = ⌈ \frac{4 max { M _{f}^{2} , M _{g}^{2} } Θ _{0}^{2}}{ε ^{2}} ⌉

δ_{k} = {⟨ \nabla f (x^{k}, ξ^{k}) - \nabla f (x^{k}), x^{k} - x_{*} ⟩ if k \in I, ⟨ \nabla g (x^{k}, ζ^{k}) - \nabla g (x^{k}), x^{k} - x_{*} ⟩ if k \in J .

δ_{k} = {⟨ \nabla f (x^{k}, ξ^{k}) - \nabla f (x^{k}), x^{k} - x_{*} ⟩ if k \in I, ⟨ \nabla g (x^{k}, ζ^{k}) - \nabla g (x^{k}), x^{k} - x_{*} ⟩ if k \in J .

f (x^{k}) - f (x_{*}) \leq \frac{h _{k}}{2} \nabla f (x^{k}, ξ^{k})_{*}^{2} +

f (x^{k}) - f (x_{*}) \leq \frac{h _{k}}{2} \nabla f (x^{k}, ξ^{k})_{*}^{2} +

+ ⟨ \nabla f (x^{k}, ξ^{k}) - \nabla f (x^{k}), x^{k} - x_{*} ⟩,

g_{j (k)} (x^{k}) - g_{j (k)} (x_{*}) \leq \frac{h _{k}}{2} ∥\nabla

g_{j (k)} (x^{k}) - g_{j (k)} (x_{*}) \leq \frac{h _{k}}{2} ∥\nabla

+ ⟨ \nabla g_{j (k)} (x^{k}, ζ^{k}) - \nabla g_{j (k)} (x^{k}), x^{k} - x_{*} ⟩ .

\displaystyle\sum\limits_{k\in I}\big{(}f(x^{k})-f(x_{*})\big{)}+\sum\limits_{k\in J}\big{(}g_{j(k)}(x^{k})

\displaystyle\sum\limits_{k\in I}\big{(}f(x^{k})-f(x_{*})\big{)}+\sum\limits_{k\in J}\big{(}g_{j(k)}(x^{k})

+ k = 0 \sum N - 1 \frac{1}{h _{k}} (V_{x^{k}} (x_{*}) - V_{x^{k + 1}} (x_{*})) + k = 0 \sum N - 1 δ_{k} .

\displaystyle\sum\limits_{k=0}^{N-1}\frac{1}{h_{k}}\big{(}V_{x^{k}}(x_{*})

\displaystyle\sum\limits_{k=0}^{N-1}\frac{1}{h_{k}}\big{(}V_{x^{k}}(x_{*})

\displaystyle+\sum\limits_{k=0}^{N-2}\left[\Big{(}\frac{1}{h_{k+1}}-\frac{1}{h_{k}}\Big{)}V_{x^{k+1}}(x_{*})-\frac{1}{h_{N-1}}V_{x^{k}}(x_{*})\right]\leq

\displaystyle\leq\frac{\Theta_{0}^{2}}{h_{0}}+\Theta_{0}^{2}\sum\limits_{k=0}^{N-2}\Big{(}\frac{1}{h_{k+1}}-\frac{1}{h_{k}}\Big{)}=\frac{\Theta_{0}^{2}}{h_{N-1}}.

\displaystyle\sum\limits_{k\in I}\big{(}f(x^{k})-f(x_{*})\big{)}+

\displaystyle\sum\limits_{k\in I}\big{(}f(x^{k})-f(x_{*})\big{)}+

+ Θ_{0} (k = 0 \sum N - 1 M_{k}^{2})^{1/2} + k = 0 \sum N - 1 δ_{k} \leq

\leq 2 Θ_{0} (k = 0 \sum N - 1 M_{k}^{2})^{1/2} + k = 0 \sum N - 1 δ_{k},

k = 0 \sum N - 1 \frac{M _{k}^{2}}{( \sum _{i = 0}^{k} M _{i}^{2} ) ^{1/2}} \leq 2 (k = 0 \sum N - 1 M_{k}^{2})^{1/2},

k = 0 \sum N - 1 \frac{M _{k}^{2}}{( \sum _{i = 0}^{k} M _{i}^{2} ) ^{1/2}} \leq 2 (k = 0 \sum N - 1 M_{k}^{2})^{1/2},

\sum\limits_{k\in J}\big{(}g_{j(k)}(x^{k})-g_{j(k)}(x_{*})\big{)}>\sum\limits_{k\in J}\varepsilon=\varepsilon N_{J}.

\sum\limits_{k\in J}\big{(}g_{j(k)}(x^{k})-g_{j(k)}(x_{*})\big{)}>\sum\limits_{k\in J}\varepsilon=\varepsilon N_{J}.

\displaystyle\sum\limits_{k\in I}\big{(}f(x^{k})-f(x_{*})\big{)}

\displaystyle\sum\limits_{k\in I}\big{(}f(x^{k})-f(x_{*})\big{)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptimization and Variational Analysis

Full text

11institutetext: Moscow Institute of Physics and Technology, Moscow, Russia

11email: [email protected]

On Modification of an Adaptive Stochastic Mirror Descent Algorithm for Convex Optimization Problems with Functional Constraints††thanks: This paper accepted to the print as a chapter in the forthcoming book: Communications in Mathematical Computations and Applications, IACMC2019, Springer.

Mohammad S. Alkousa

0000-0001-5470-0182

Abstract

This paper is devoted to a new modification of a recently proposed adaptive stochastic mirror descent algorithm for constrained convex optimization problems in the case of several convex functional constraints. Algorithms, standard and its proposed modification, are considered for the type of problems with non-smooth Lipschitz-continuous convex objective function and convex functional constraints. Both algorithms, with an accuracy $\varepsilon$ of the approximate solution to the problem, are optimal in the terms of lower bounds of estimates and have the complexity $O\left(\varepsilon^{-2}\right)$ . In both algorithms, the precise first-order information, which connected with (sub)gradient of the objective function and functional constraints, is replaced with its unbiased stochastic estimates. This means that in each iteration, we can still use the value of the objective function and functional constraints at the research point, but instead of their (sub)gradient, we calculate their stochastic (sub)gradient. Due to the consideration of not all functional constraints on non-productive steps, the proposed modification allows saving the running time of the algorithm. Estimates for the rate of convergence of the proposed modified algorithm is obtained. The results of numerical experiments demonstrating the advantages and the efficient of the proposed modification for some examples are also given.

Keywords:

Lipschitz-continuous function, non-smooth constrained optimization, adaptive stochastic mirror descent, stochastic (sub)gradient.

1 Introduction

Large scale non-smooth convex optimization is a common problem for a range of computational areas including statistics, computer vision, general inverse problems, machine learning, data science and in many applications arising in applied sciences and engineering. Since what matters most in practice is the overall computational time to solve the problem, first-order methods with computationally low-cost iterations become a viable choice for large scale optimization problems.

Generally, first-order methods have simple structures with a low memory requirement. Thanks to these features, they have received much attention during the last decade. There are a lot of first-order methods for solving the optimization problems in the case of non-smooth objective function. Some examples of these methods, to name but a few, are: subgradient methods [23, 26, 30], subgradient projection methods [23, 26, 30], OSGA [22], bundle-level method [23], Lagrange multipliers method [9] and many others.

There is a long history of studies on continuous optimization with functional constraints. The recent works on first-order methods for convex optimization with convex functional constraints include [5, 14, 16, 32, 33, 34] for deterministic constraints and [1, 2, 15, 35] for stochastic constraints. However, the parallel development for problems with non-convex objective functions and also with non-convex constraints, especially for theoretically provable algorithms, remains limited, see [17] and references therein.

The mirror descent algorithm which originated in [20, 21] and was later analyzed in [7], is considered as the non-Euclidean extension of subgradient methods. The standard subgradient methods employ the Euclidean distance function with a suitable step-size in the projection step. Mirror descent extends the standard projected subgradient methods by employing a nonlinear distance function with an optimal step-size in the nonlinear projection step [18]. Mirror descent method not only generalizes the standard gradient descent method, but also achieves a better convergence rate [12]. In addition, Mirror descent method is applicable to optimization problems in Banach spaces where gradient descent is not [12]. An extension of the mirror descent method for constrained problems was proposed in [6, 20].

Usually, the step-size and stopping rule for mirror descent algorithms require to know the Lipschitz constant of the objective function and constraint, if any. Adaptive step-sizes, which do not require this information, are considered for unconstrained problems in [8], and for constrained problems in [6]. Some optimal mirror descent algorithms, for convex optimization problems with non-smooth convex functional constraint and both adaptive step-sizes and stopping rules, are proposed in [5]. Also, there were considered some modifications of these algorithms for the case of problems with many functional constraints in [31].

If we focus on the problems of minimization of an objective function consisting of a large number of component functionals, such as $f(x)=\sum\limits_{j=1}^{N}f_{j}(x)$ where $f_{j}:\mathbb{R}^{n}\rightarrow\mathbb{R},j=\overline{1,N}$ are convex, then in each iteration of any iterative minimization procedure computing a single (sub)gradient $\nabla f(x)=\sum\limits_{j=1}^{N}\nabla f_{j}(x)$ becomes very expensive. Therefore there is an incentive to calculate the stochastic (sub)gradient $\nabla f(x,\zeta)$ where $\zeta$ is a random variable taking its values in $\{1,\ldots,N\}$ . This mean that $\nabla f(x,\zeta)=\nabla f_{i}(x)$ , were $i$ is chosen randomly in each iteration from the set $\{1,\ldots,N\}$ , or instead, one can employ randomly chosen a mini-bach approach in which a small subset $S\subset\{1,\ldots,N\}$ is chosen randomly, then $\nabla f(x,\zeta)=\sum\limits_{i\in S}\nabla f_{i}(x)$ . This randomly calculating of the (sub)gradient is known as stochastic (sub)gradient.

In the stochastic version of an optimization method, the exact first-order information is replaced with its unbiased stochastic estimates, where the exact first-order information is unavailable. This permits accelerating the solution process, with the earning from randomization growing progressively with problem’s sizes. A different approach to solving stochastic optimization problems is called stochastic approximation (SA), which was initially proposed in a seminal paper by Robbins and Monro in 1951 [28]. An important improvement of this algorithm was developed by Polyak and Juditsky [24, 25]. More recently, Nemirovski et al. [19] presented a modified stochastic approximation method and demonstrated its superior numerical performance for solving a general class of non-smooth convex problems.

This paper is devoted to a new modification of an adaptive stochastic mirror descent algorithm (see Algorithm 4 in [5]. This algorithms is listed as Algorithm 1, below), which is proposed to solve the stochastic setup (randomized version) of the convex minimization problems in the case of several convex functional constraints. This means that we can still use the value of the objective function and functional constraints at the research point, but instead of their (sub)gradient, we use their stochastic (sub)gradient. Namely, that we consider the first-order unbiased oracle that produces stochastic (sub)gradients of the objective function and functional constraints, see for example [13, 29]. We consider the arbitrary proximal structure and the type of problems with non-smooth Lipschitz-continuous objective function. Furthermore, it has been proved a theorem to estimate the rate of convergence of the proposed modification, from this theorem we can see that the modified algorithm achieves the optimal complexity of the order $O\left(\varepsilon^{-2}\right)$ for the class of problems under consideration (see [20]).

The rest of the paper is organized as follows. In Section 2 we give some basic notation, summarize the problem statement and standard mirror descent basics. In Section 3 we display the adaptive stochastic mirror descent algorithm (Algorithm 4 in [5]). Section 4 is devoted to the proposed modified algorithm and proving a theorem about the rate of convergence of this algorithm and its optimal complexity estimate. In the last section, we consider some numerical experiments that allow us to compare the work of standard algorithm and its proposed modification for certain examples.

2 Problem Statement and Standard Mirror Descent Basics

Let $\mathbb{V}$ be a finite-dimensional vector space, endowed with the norm $\|\cdot\|$ , and $\mathbb{V}^{*}$ is the conjugate space of $\mathbb{V}$ with the following norm

[TABLE]

where $\langle h,x\rangle$ is the value of the continuous linear functional $h$ at $x\in\mathbb{V}$ .

Let $Q\subset\mathbb{V}$ be a closed convex set, $f$ and $g_{j}:Q\rightarrow\mathbb{R}\,(j=\overline{1,m})$ convex subdifferentiable functionals. We assume that $f$ and $g_{j}(j=\overline{1,m})$ are Lipschitz-continuous, i.e. there exist $M_{f}>0$ and $M_{g}>0$ , such that

[TABLE]

It is clear that instead of a set of functionals $\{g_{j}(\cdot)\}_{j=1}^{m}$ we can see one functional $g:Q\rightarrow\mathbb{R}$ , such that

[TABLE]

It means that at every point $x\in Q$ there is a subgradient $\nabla g(x)$ , and $\|\nabla g(x)\|_{*}\leq M_{g}$ . Recall that for a differentiable functional $g$ , the subgradient $\nabla g(x)$ coincides with the usual gradient.

In this paper, we consider the stochastic setup of the following convex constrained optimization problem

[TABLE]

For the stochastic setup of the problem (3), we introduce the following assumptions (see [4, 5]). Given a point $x\in Q$ , we can calculate the stochastic (sub)gradients $\nabla f(x,\xi)$ and $\nabla g(x,\zeta)$ , where $\xi$ and $\zeta$ are random vectors. These stochastic (sub)gradients satisfy

[TABLE]

where $\mathbb{E}$ denote to the expectation, and

[TABLE]

To motivate these assumptions, let $S_{n}(1)=\left\{x\in\mathbb{R}_{+}^{n}\;|\;\displaystyle\sum_{i=1}^{n}x_{i}=1\right\}$ be a standard unit simplex in $\mathbb{R}^{n}$ , we consider the following optimization problem

[TABLE]

where $A$ is a given $n\times n$ matrix and $c_{i}\;(i=\overline{1,m})$ are given vectors in $\mathbb{R}^{n}.$ (See [5])

The exact computation of the gradient $\nabla f(x)=Ax$ takes $O(n^{2})$ arithmetic operations, which is expensive, when $n$ is very large, for the huge-scale optimization problems. In this setting, it is natural to use the randomization to construct a stochastic approximation for $\nabla f(x)$ . Let $\xi$ be a random variable its values $1,\ldots,n$ with probabilities $x_{1},\ldots,x_{n}$ respectively. Let $A^{\langle i\rangle}$ denote the $i$ -th column of the matrix $A$ . Since $x\in S_{n}(1)$ ,

[TABLE]

where $\mathbb{P}$ denote to the probability of an event.

Thus, we can use $A^{\langle\xi\rangle}$ as a stochastic gradient of $f$ (i.e. $\nabla f(x,\xi)=A^{\langle\xi\rangle}$ ), which can be calculated in $O(n)$ arithmetic operations.

Let $d:Q\rightarrow\mathbb{R}$ be a distance generating function, which is continuously differentiable and $1$ -strongly convex with respect to the norm $\|\cdot\|$ , i.e.

[TABLE]

and assume that $\min\limits_{x\in Q}d(x)=d(0).$ Suppose, we have a constant $\Theta_{0}>0$ such that $d(x_{*})\leq\Theta_{0}^{2},$ where $x_{*}$ is a solution to the problem (3).

Note that if there is a set of optimal points for (3) $X_{*}\subset Q$ , we may assume that

[TABLE]

For all $x,y\in Q\subset\mathbb{V}$ , we consider the corresponding Bregman divergence, which was initially studied by Bregman [10] and later by many others (see [3]),

[TABLE]

In particular, in the standard proximal setup (i.e. Euclidean setup) we can choose $d(x)=\frac{1}{2}\|x\|_{2}^{2}$ , then $V_{x}(y)=\frac{1}{2}\|x-y\|_{2}^{2}$ . Another setups, for example entropy, $\ell_{1}/\ell_{2}$ , simplex, spectahedron and many others, can be found in [8].

We also assume that the constant $\Theta_{0}>0$ is known, such that

[TABLE]

For all $x\in Q$ and $p\in\mathbb{V}^{*}$ , the proximal mapping operator (mirror descent step) is defined as

[TABLE]

We make the simplicity assumption, which means that $\mathrm{Mirr}_{x}(p)$ is easily computable.

Let $x_{*}$ be a solution to (3) and $\varepsilon>0$ is given, we say that a (random) point $\hat{x}\in Q$ is an expected $\varepsilon$ -solution to (3) if

[TABLE]

The following well-known lemma describes the main property of the proximal mapping operator (see [5, 8]).

Lemma 1

Let $f:Q\rightarrow\mathbb{R}$ be a convex subdifferentiable function over the convex set $Q$ and $z=Mirr_{y}\left(h\nabla f(y,\xi)\right)$ for some $h>0$ , $y,z\in Q$ and $\xi$ random vector. Then for each $x\in Q$ we have

[TABLE]

3 Adaptive Stochastic Mirror Descent Algorithm

In [5] it was considered an adaptive method, for the convex optimization problem (3) in the stochastic setup described above (see Algorithm 1). In this setting, the output of the algorithm is random, in the sense of (7). The adaptivity of this method is in terms of step-size and stopping role, which is mean that we do not need to know the constants $M_{f}$ and $M_{g}$ in advance. We assume that, on each iteration of the algorithm, independent realizations of the random variables $\xi$ and $\zeta$ are generated. In this section, we show this algorithm and the fundamental result of the estimate about the convergence rate of this algorithm.

As can be seen from the items of the Algorithm 1, the needed point (Ensure) is selected among the points $x^{k}$ for which $g(x^{k})\leq\varepsilon$ . Therefore, we will call step $k$ productive if $g(x^{k})\leq\varepsilon$ . If the reverse inequality $g(x^{k})>\varepsilon$ holds then step $k$ will be called non-productive.

Let $I,J$ denote the set of indexes of productive and non-productive steps produced by Algorithm 1, respectively. $N_{I},N_{J}$ denote the number of productive and non-productive steps, respectively.

For the complexity estimate of Algorithm 1, the next result was obtained in [4, 5].

Theorem 3.1

*Let equalities (4) and inequalities (5) hold. Assume that a known constant $\Theta_{0}>0$ is such that inequality (6) holds. Then Algorithm 1 stops after no more than

[TABLE]

iterations and $\bar{x}^{N}$ is an expected $\varepsilon$ -solution to problem (3) in the sense of (7).

4 The Modification of an Adaptive Stochastic Mirror Descent Algorithm

In this section, we consider a modification of an Algorithm 1. The idea of this modification was considered in [31] for some adaptive mirror descent algorithms to solve the deterministic setup of the convex optimization problems with Lipschitz-continuous functional constraints. This idea is summarized as: when we have a non-productive step $k$ , i.e. $g(x^{k})>\varepsilon$ , then instead of calculating the subgradient of the functional constraint with max-type $g(x)=\max\limits_{i=\overline{1,m}}\{g_{i}(x)\}$ , we calculate (sub)gradient of one functional $g_{j}$ , for which we have $g_{j}(x^{k})>\varepsilon$ . The proposed modification allows saving the running time of algorithm due to consideration of not all functional constraints on non-productive steps.

Denote

[TABLE]

By Lemma 1, with $y=x^{k},z=x^{k+1}$ and $x=x_{*}$ , we have for all $k\in I$

[TABLE]

the same for all $k\in J$ , we have (remember that, with $g_{j(k)}(\cdot)$ we mean any constraint, such that $g_{j(k)}(x^{k})>\varepsilon$ ),

[TABLE]

Taking summation, in each side of (9) and (10), over productive and non-productive steps, we get

[TABLE]

Using (6), we have

[TABLE]

Whence, by the definition of step-sizes $h_{k}$

[TABLE]

where we used the inequality

[TABLE]

which can be proved by induction. Since, for $k\in J$ , $g_{j(k)}(x^{k})-g_{j(k)}(x_{*})\geq g_{j(k)}(x^{k})>\varepsilon$ , we get

[TABLE]

Thus from (12) and the stopping criterion of Algorithm 2, we have

[TABLE]

We can rewrite (13) as follows

[TABLE]

By the convexity of $f$ , we get

[TABLE]

where $f^{*}=f(x_{*})$ . By the definition of $\bar{x}^{N}$ (see the Ensure of Algorithm 2), we get the following inequality

[TABLE]

As long as the inequality (16) is strict, the case of $I=\emptyset$ is impossible (i.e. $N_{I}\neq 0$ ). Now by taking the expectation in (16) we obtain

[TABLE]

but $\sum\limits_{k=0}^{N-1}\mathbb{E}\left[\frac{\delta_{k}}{N_{I}}\right]=0$ , (see [4]). Thus

[TABLE]

At the same time, for $k\in I$ it holds that $g(x^{k})\leq\varepsilon$ . Then, by the definition of $\bar{x}^{N}$ and the convexity of $g$ we get

[TABLE]

Thus we have come the following result

Theorem 4.1

*Let equalities (4) and inequalities (5) hold. Assume that a known constant $\Theta_{0}>0$ is such that inequality (6) holds. Then Algorithm 2 stops after no more than

[TABLE]

iterations and $\bar{x}^{N}$ is an expected $\varepsilon$ -solution to problem (3) in the sense of (7).

Remark 1

From the estimate (18) we can see that Algorithm 2 achieves the complexity of the order $O\left(\varepsilon^{-2}\right)$ , which is an optimal, for the studied class of non-smooth functions, from the point of view of the theory of lower bounds of estimates, according to Nemirovski and Yudin (see [20]).

5 Numerical Experiments

In order to compare Algorithms 1 and 2, and to show the advantages of the proposed modified algorithm some numerical tests were carried out. We consider some different examples of the following non-smooth finite-sum problem

[TABLE]

where each summand $f_{i}$ is a Lipschitz-continuous function. This problem is ubiquitous in many areas and applications, in particular in machine learning applications, $f$ is the total loss function whereas each $f_{i}$ represents the loss due to the $i$ -th training sample [11, 27].

In our experiments, we consider the following two examples of the problem (19)

Example 1

[TABLE]

where the coefficients $a_{i}\in\mathbb{R}^{n}$ and $b_{i}\in\mathbb{R}$ for each $i=1,\ldots,N$ .

Example 2

[TABLE]

where $C_{i}\in\mathbb{R}^{n\times n}$ , for each $i=1,\ldots,N$ , are positive definite matrices, i.e. $C_{i}\succ 0$ .

For the coefficients $a_{i}\in\mathbb{R}^{n}$ and constants $b_{i}\in\mathbb{R}$ $(i=1,\ldots,N)$ , in example 1, with different values of $N$ . Let $A\in\mathbb{R}^{N\times(n+1)}$ be a matrix with entries drawn from different random distributions. Then $a_{i}^{T}$ are rows in the matrix $A^{\prime}\in\mathbb{R}^{N\times n}$ , which is obtained from $A$ , by eliminating the last column, and $b_{i}$ are the entries of the last column in the matrix $A$ . The positive definite matrices $C_{i}\succ 0$ $(i=1,\ldots,N)$ , in example 2, with different values of $N$ , are drawn from different random distributions. In more details, the entries of $A$ and $C_{i}(i=1,\ldots,N)$ , with different values of $N$ , are drawn

When $N=75$ , from the Gumbel distribution with the location of the mode equaling $1$ and the scale parameter equaling $2$ . 2. 2.

When $N=100$ , from the standard exponential distribution with a scale parameter of $1$ . 3. 3.

When $N=150$ , from the uniform distribution over $[0,1)$ .

For the functional constraint $g(x)=\max\limits_{i\in\overline{1,m}}\{g_{i}(x)\}$ , we take $m=50,n=1500$ and $g_{i}(x)=\langle\alpha_{i},x\rangle+\beta_{i}$ linear functionals, where the coefficients $\alpha_{i}\in\mathbb{R}^{n}$ and $\beta_{i}\in\mathbb{R}$ for $i=1,\ldots,m$ are taken as follows: Let $B\in\mathbb{R}^{m\times(n+1)}$ be a Toeplitz matrix with the first row $(1,1,\ldots,1)\in\mathbb{R}^{n+1}$ and the first column $(1,2,\ldots,m)^{T}$ . Then $\alpha_{i}^{T}$ are rows in the matrix $B^{\prime}\in\mathbb{R}^{m\times n}$ , which is obtained from $B$ , by eliminating the last column, and $\beta_{i}$ are the entries of the last column in the matrix $B$ , i.e. the eliminated column.

For more clarification, when $m=10$ and $n=14$ , then the Toeplitz matrix $B$ with the first row $(1,1,\ldots,1)\in\mathbb{R}^{15}$ and the first column $(1,2,\ldots,10)^{T}$ has the form

[TABLE]

The proximal structure is given by Euclidean norm and squared Euclidean norm as a prox-function. We choose starting point $x^{0}=\left(\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}},\ldots,\frac{1}{\sqrt{n}}\right)\in\mathbb{R}^{n}$ , $\varepsilon=0.05$ , and $Q=\{x=(x_{1},x_{2},\ldots,x_{n})\in\mathbb{R}^{n}\,|\,x_{1}^{2}+x_{2}^{2}+\ldots+x_{n}^{2}\leq 1\}.$

For any $x=(x_{1},\ldots,x_{n})$ and $y=(y_{1},\ldots,y_{n})$ in $Q$ , the following inequality holds

[TABLE]

Therefore, we can choose $\Theta_{0}=\sqrt{2}.$

Our experiments are motivated by the need to solve the problem (3) when either the dimension $n$ is large or when the objective function $f$ is of a finite sum structure, as in examples 1 and 2, with $N$ , the number of components, being large.

We run Algorithms 1 and 2, in order to both Examples 1 and 2, with $m=50,n=1500$ . The results of the work of Algorithms 1 and 2 are represented in Table 1, below. These results demonstrate the comparison between the number of iterations and the running time (in seconds) for each algorithm.

All experiments were implemented in Python 3.4, on a computer fitted with Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical Processor(s). RAM of the computer is 8GB.

From Table 1, in order to both examples 1 and 2, we can see that the modified Algorithm 2 always works better than Algorithm 1. It is clearly shown in all experiments according to the number of iterations and especially according to the running time of the algorithms. The running time of Algorithm 2 is very small compared to the running time of Algorithm 1 (on average, it is smaller 25 times). This feature of the Algorithm 2 is very important in all applications of mathematical optimization.

Remark 2

Now, as in the previous, to compare Algorithms 1 and 2, with $m=50,n=100$ and different values of $N$ , some additional numerical tests were carried out. The coefficients $\alpha_{i}\in\mathbb{R}^{n}$ and $\beta_{i}\in\mathbb{R}$ , for each $i=1,\ldots,m$ , are the entries of the Toeplitz matrix, which is described above. The entries of the matrices $C_{i}(i=1,\ldots,N)$ are drawn from the uniform distribution over $[0,1)$ . We run Algorithms 1 and 2 with the same previous parameters $\varepsilon=0.05,\Theta_{0}=\sqrt{2}$ and the set $Q$ . The results of Algorithms 1 and 2, in order to the examples 1 and 2 are represented in Table 2, below. These results demonstrate the comparison between the number of iterations and the running time (in seconds) for each algorithm, with different values of $N$ .

From Table 2, we can see that Algorithm 2 works better than Algorithm 1 according to the number of iterations and especially according to the running time of algorithms.

5.1 Additional Experiments: Fermat-Torricelli-Steiner problem

In this subsection some additional numerical experiments connected with the analogue of the well-known Fermat-Torricelli-Steiner problem with some non-smooth functional constraints, were carried out.

For a given set $\{A_{k}=(a_{1k},a_{2k},\ldots,a_{nk});\,k=\overline{1,N}\}$ of $N$ points, in $n$ -dimensional Euclidean space $\mathbb{R}^{n}$ , we need to solve the problem (3), where the objective function $f$ is given by

[TABLE]

The functional constraint is given by $g(x)=\max\limits_{i\in\overline{1,m}}\{g_{i}(x)=\langle\alpha_{i},x\rangle+\beta_{i}\}$ , where the coefficients $\alpha_{i}\in\mathbb{R}^{n}$ and $\beta_{i}\in\mathbb{R}$ are taken as in the previous experiments (the entries of the Toeplitz matrix $B$ ).

We take the points $A_{k}(k\in\overline{1,N})$ in the unit ball $Q$ . The coordinates of these points are drawn from the uniform distribution over $[0,1)$ .

We choose the standard Euclidean proximal setup, starting point $x^{0}=\textbf{0}\in\mathbb{R}^{n}$ and $\Theta_{0}=\sqrt{2}$ . We run Algorithms 1 and 2 with $n=1000,m=250,N=100$ and different values of accuracy $\varepsilon\in\{1/2^{i}:i=1,2,3,4,5,6\}$ .

The results of the work of Algorithms 1 and 2, are presented in Fig. 1(a) (the number of iterations produced by the studied algorithms to reach an $\varepsilon$ -solution of the proposed problem as a function of accuracy) and Fig. 1(b) (the required running time of the studied algorithms, in seconds, as a function of accuracy).

From Fig. 1(a) and Fig. 1(b), we see that both Algorithms 1 and 2 are optimal, where they achieve the complexity of the order $O\left(\varepsilon^{-2}\right)$ , which is optimal estimate for the studied class of non-smooth functions. But Algorithm 2 is more efficiently and works better than Algorithm 1, according to the number of iterations and the running time. We note that the running time of Algorithm 1 is very long compared with the running time of Algorithm 2, where by Algorithm 2 one needs a few seconds, when needs more and more minutes by Algorithm 1, to achieve a solution and to reach its stopping criterion. Therefore the efficiency of Algorithm 2 is represented by its very high execution speed compared with Algorithm 1.

6 Conclusions

In this work, a new modification of an adaptive stochastic mirror descent algorithm was proposed to solve the stochastic setting of the convex minimization problem in the case of Lipschitz-continuous objective function and several convex functional constraints. In each iteration of the proposed modified algorithm, we calculate the stochastic (sub)gradient of the objective function or the functional of constraint, which is prevalent and effective in Machine Learning scenarios, large-scale optimization problems, and their applications. The proposed modification allows saving the running time of algorithm due to the consideration of not all functional constraints on non-productive steps. Furthermore, it has been proved a theorem to estimate the rate of convergence of the proposed modified algorithm. Numerical experiments for a geometrical problem, Fermat-Torricelli-Steiner problem, with convex constraints are presented. The results of carried out numerical experiments illustrate the advantages of the modified Algorithm 2 and illustrate that the running time of this Algorithm is very small compared to the running time of the standard Algorithm 1.

Acknowledgments: The author is very grateful to Alexander V. Gasnikov, Fedor S. Stonyakin and Alexander G. Biryukov for fruitful discussions.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Alkousa M. S.: On Some Stochastic Mirror Descent Methods for Constrained Online Optimization Problems. Computer Research and Modeling, 11 (2), 205–217 (2019).
2[2] Basu K., Nandy P.: Optimal Convergence for Stochastic Optimization with Multiple Expectation Constraints. (2019). https://arxiv.org/pdf/1906.03401.pdf
3[3] Bauschke H. H., Borwein J. M., Combettes P. L.: Bregman monotone optimization algorithms. SIAM Journal on Controal and Optimization 42 (2), 596–636 (2003).
4[4] Bayandina A.: Adaptive Stochastic Mirror Descent for Constrained Optimization. (2017). https://arxiv.org/pdf/1705.02031.pdf
5[5] Bayandina A., Dvurechensky P., Gasnikov A., Stonyakin F., Titov A.: Mirror descent and convex optimization problems with non-smooth inequality constraints. In: Large-Scale and Distributed Optimization, 181–213. Springer, Cham (2018).
6[6] Beck A., Ben-Tal A., Guttmann-Beck N., Tetruashvili L.: The comirror algorithm for solving nonsmooth constrained convex problems. Operations Research Letters 38 (6), 493–498 (2010).
7[7] Beck A., Teboulle M.: Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31 (3), 167–175 (2003).
8[8] Ben-Tal A., Nemirovski A.: Lectures on Modern Convex Optimization. Society for Industrial and Applied Mathematics, Philadelphia (2001).