Random minibatch subgradient algorithms for convex problems with   functional constraints

Angelia Nedich; Ion Necoara

arXiv:1903.02117·math.OC·January 11, 2024

Random minibatch subgradient algorithms for convex problems with functional constraints

Angelia Nedich, Ion Necoara

PDF

TL;DR

This paper introduces randomized minibatch subgradient algorithms for non-smooth convex optimization problems with complex functional constraints, providing convergence analysis and demonstrating minibatch benefits.

Contribution

It proposes novel subgradient algorithms with random minibatch feasibility updates for convex problems with level set constraints, analyzing their convergence behavior.

Findings

01

Convergence rates are sublinear and optimal for the class of problems.

02

Rates explicitly depend on minibatch size, showing when minibatching improves performance.

03

Algorithms handle constraints given as convex level sets, not just simple sets.

Abstract

In this paper we consider non-smooth convex optimization problems with (possibly) infinite intersection of constraints. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, which are easy to project onto, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. For these settings we propose subgradient iterative algorithms with random minibatch feasibility updates. At each iteration, our algorithms take a step aimed at only minimizing the objective function and then a subsequent step minimizing the feasibility violation of the observed minibatch of constraints. The feasibility updates are performed based on either parallel or sequential random observations of several constraint components. We analyze the convergence behavior of the proposed…

Equations301

f (x)

f (x)

x \in X, X ≜ Y \cap (\cap_{ω \in A} X_{ω})),

X_{ω} = {x \in R^{n} ∣ g_{ω} (x) \leq 0} for every ω \in A,

f^{*} = x \in X in f f (x), X^{*} = {x \in X ∣ f (x) = f^{*}} .

f^{*} = x \in X in f f (x), X^{*} = {x \in X ∣ f (x) = f^{*}} .

f (y) \geq f (x) + ⟨ s_{f} (x), y - x ⟩ + \frac{μ}{2} ∥ y - x ∥^{2} \forall x, y \in Y, s_{f} (x) \in \partial f (x) .

f (y) \geq f (x) + ⟨ s_{f} (x), y - x ⟩ + \frac{μ}{2} ∥ y - x ∥^{2} \forall x, y \in Y, s_{f} (x) \in \partial f (x) .

∥ s_{f} (x) ∥ \leq M_{f} \forall s_{f} (x) \in \partial f (x) and x \in Y .

∥ s_{f} (x) ∥ \leq M_{f} \forall s_{f} (x) \in \partial f (x) and x \in Y .

∥ d ∥ \leq M_{g} \forall d \in \partial g_{ω} (x), x \in Y and ω \in A .

∥ d ∥ \leq M_{g} \forall d \in \partial g_{ω} (x), x \in Y and ω \in A .

f (x) - f^{*} \geq \frac{μ}{2} ∥ x - x^{*} ∥^{2} \forall x \in X \subseteq Y .

f (x) - f^{*} \geq \frac{μ}{2} ∥ x - x^{*} ∥^{2} \forall x \in X \subseteq Y .

\frac{μ}{2} ∥ x - x^{*} ∥^{2} \leq f (x) - f^{*} \leq ⟨ s_{f} (x), x - x^{*} ⟩ \leq M_{f} ∥ x - x^{*} ∥ \forall x \in X, x^{*} \in X^{*},

\frac{μ}{2} ∥ x - x^{*} ∥^{2} \leq f (x) - f^{*} \leq ⟨ s_{f} (x), x - x^{*} ⟩ \leq M_{f} ∥ x - x^{*} ∥ \forall x \in X, x^{*} \in X^{*},

v_{k} = Π_{Y} [x_{k - 1} - α_{k - 1} s_{f} (x_{k - 1})],

v_{k} = Π_{Y} [x_{k - 1} - α_{k - 1} s_{f} (x_{k - 1})],

z_{k}^{i} = v_{k} - β \frac{g _{ω_{k}^{i}}^{+} ( v _{k} )}{∥ d _{k}^{i} ∥ ^{2}} d_{k}^{i} for i = 1 : N,

x_{k} = Π_{Y} [\overset{z}{ˉ}_{k}], with \overset{z}{ˉ}_{k} = \frac{1}{N} i = 1 \sum N z_{k}^{i} .

z_{k}^{i} = v_{k} - β (v_{k} - Π_{X_{ω_{k}^{i}}} [v_{k}]) .

z_{k}^{i} = v_{k} - β (v_{k} - Π_{X_{ω_{k}^{i}}} [v_{k}]) .

F_{k} = {x_{0}} \cup {ω_{t}^{j} ∣ 1 \leq t \leq k, 1 \leq j \leq N},

F_{k} = {x_{0}} \cup {ω_{t}^{j} ∣ 1 \leq t \leq k, 1 \leq j \leq N},

dist^{2} (y, X) \leq c \cdot E [(g_{ω_{k}^{i}}^{+} (y))^{2} ∣ F_{k - 1}] \forall y \in Y, k \geq 1 and i = 1, \dots, N .

dist^{2} (y, X) \leq c \cdot E [(g_{ω_{k}^{i}}^{+} (y))^{2} ∣ F_{k - 1}] \forall y \in Y, k \geq 1 and i = 1, \dots, N .

∥ Π_{Y} [v] - y ∥^{2} \leq ∥ v - y ∥^{2} - ∥ Π_{Y} [v] - v ∥^{2} for any v \in R^{n} and y \in Y .

∥ Π_{Y} [v] - y ∥^{2} \leq ∥ v - y ∥^{2} - ∥ Π_{Y} [v] - v ∥^{2} for any v \in R^{n} and y \in Y .

c M_{g}^{2} \geq 1.

c M_{g}^{2} \geq 1.

0 = g_{\overset{ω}{ˉ}}^{+} (Π_{X} [y]) \geq g_{\overset{ω}{ˉ}}^{+} (y) + ⟨ s_{g} (y), Π_{X} [y] - y ⟩ \geq g_{\overset{ω}{ˉ}}^{+} (y) - M_{g} ∥ Π_{X} [y] - y ∥,

0 = g_{\overset{ω}{ˉ}}^{+} (Π_{X} [y]) \geq g_{\overset{ω}{ˉ}}^{+} (y) + ⟨ s_{g} (y), Π_{X} [y] - y ⟩ \geq g_{\overset{ω}{ˉ}}^{+} (y) - M_{g} ∥ Π_{X} [y] - y ∥,

g_{\overset{ω}{ˉ}}^{+} (y) \leq M_{g} ∥ Π_{X} [y] - y ∥.

g_{\overset{ω}{ˉ}}^{+} (y) \leq M_{g} ∥ Π_{X} [y] - y ∥.

0 = g_{ω}^{+} (y) \leq M_{g} ∥ Π_{X} [y] - y ∥.

0 = g_{ω}^{+} (y) \leq M_{g} ∥ Π_{X} [y] - y ∥.

g_{ω}^{+} (y) \leq M_{g} ∥ Π_{X} [y] - y ∥.

g_{ω}^{+} (y) \leq M_{g} ∥ Π_{X} [y] - y ∥.

dist^{2} (y, X)

dist^{2} (y, X)

\leq c E [M_{g}^{2} ∥ Π_{X} [y] - y ∥^{2} ∣ F_{k - 1}] = c M_{g}^{2} dist^{2} (y, X),

∥ v_{k + 1} - x^{*} ∥^{2} + 2 α_{k} (1 - ρ) (f (Π_{X} [x_{k}]) - f^{*})

∥ v_{k + 1} - x^{*} ∥^{2} + 2 α_{k} (1 - ρ) (f (Π_{X} [x_{k}]) - f^{*})

\leq (1 - α_{k} ρ μ) ∥ x_{k} - x^{*} ∥^{2} + 2 α_{k} M_{f} ∥ Π_{X} [x_{k}] - x_{k} ∥ + α_{k}^{2} M_{f}^{2} .

∥ v_{k + 1} - x^{*} ∥^{2} \leq ∥ x_{k} - x^{*} ∥^{2} - 2 α_{k} (f (x_{k}) - f (x^{*})) + α_{k}^{2} M_{f}^{2} .

∥ v_{k + 1} - x^{*} ∥^{2} \leq ∥ x_{k} - x^{*} ∥^{2} - 2 α_{k} (f (x_{k}) - f (x^{*})) + α_{k}^{2} M_{f}^{2} .

f (x_{k}) - f (x^{*}) \geq ⟨ s_{f} (x^{*}), x_{k} - x^{*} ⟩ + \frac{μ}{2} ∥ x_{k} - x^{*} ∥^{2}

f (x_{k}) - f (x^{*}) \geq ⟨ s_{f} (x^{*}), x_{k} - x^{*} ⟩ + \frac{μ}{2} ∥ x_{k} - x^{*} ∥^{2}

= ⟨ s_{f} (x^{*}), Π_{X} [x_{k}] - x^{*} ⟩ + ⟨ s_{f} (x^{*}), x_{k} - Π_{X} [x_{k}]⟩ + \frac{μ}{2} ∥ x_{k} - x^{*} ∥^{2}

\geq ⟨ s_{f} (x^{*}), x_{k} - Π_{X} [x_{k}]⟩ + \frac{μ}{2} ∥ x_{k} - x^{*} ∥^{2}

\geq - M_{f} ∥ Π_{X} [x_{k}] - x_{k} ∥ + \frac{μ}{2} ∥ x_{k} - x^{*} ∥^{2},

f (x_{k}) - f (x^{*})

f (x_{k}) - f (x^{*})

\geq - ∥ s_{f} (Π_{X} [x_{k}]) ∥ ∥ Π_{X} [x_{k}] - x_{k} ∥ + f (Π_{X} [x_{k}]) - f (x^{*}),

f (x_{k}) - f (x^{*}) \geq f (Π_{X} [x_{k}]) - f (x^{*}) - M_{f} ∥ Π_{X} [x_{k}] - x_{k} ∥.

f (x_{k}) - f (x^{*}) \geq f (Π_{X} [x_{k}]) - f (x^{*}) - M_{f} ∥ Π_{X} [x_{k}] - x_{k} ∥.

f (x_{k}) - f (x^{*})

f (x_{k}) - f (x^{*})

∥ v_{k + 1} - x^{*} ∥^{2}

∥ v_{k + 1} - x^{*} ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: A. Nedić 22institutetext: School of Electrical, Computer and Energy Engineering

Arizona State University, Tempe, USA

22email: [email protected]. 33institutetext: I. Necoara 44institutetext: Department of Automatic Control and Systems Engineering

University Politehnica Bucharest, 060042 Bucharest, Romania

44email: [email protected] (corresponding author).

Random minibatch subgradient algorithms for convex problems with functional constraints

Angelia Nedić

Ion Necoara

(Received: 28 February 2019 / Accepted: date)

Abstract

In this paper we consider non-smooth convex optimization problems with (possibly) infinite intersection of constraints. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, which are easy to project onto, in this paper we consider that each constraint set is given as the level set of a convex but not necessarily differentiable function. For these settings we propose subgradient iterative algorithms with random minibatch feasibility updates. At each iteration, our algorithms take a subgradient step aimed at only minimizing the objective function and then a subsequent subgradient step minimizing the feasibility violation of the observed minibatch of constraints. The feasibility updates are performed based on either parallel or sequential random observations of several constraint components. We analyze the convergence behavior of the proposed algorithms for the case when the objective function is strongly convex and with bounded subgradients, while the functional constraints are endowed with a bounded first-order black-box oracle. For a diminishing stepsize, we prove sublinear convergence rates for the expected distances of the weighted averages of the iterates from the constraint set, as well as for the expected suboptimality of the function values along the weighted averages. Our convergence rates are known to be optimal for subgradient methods on this class of problems. Moreover, the rates depend explicitly on the minibatch size and show when minibatching helps a subgradient scheme with random feasibility updates.

Keywords:

Convex minimization functional constraints subgradient algorithms random minibatch projection algorithms convergence rates.

1 Introduction

The large sum of functions in the objective function and/or the large number of constraints in most of the practical optimization applications led the stochastic optimization field to become an essential tool for many applied mathematics areas, such as machine learning and statistics MouBac:11 ; Vap:98 , constrained control PatNec:17 , sensor networks BlaHer:06 , computer science KunBac:18 , inverse problems BotHei:12 , operations research and finance RocUry:00 . For example, in machine learning applications the optimization algorithms involve numerical computation of parameters for a system designed to make decisions based on yet unseen data MouBac:11 ; Vap:98 . In particular, in support vector machines one maps the data into a higher dimensional input space and constructs an optimal separating hyperplane in this space by learning, eventually online, the hyperplanes corresponding to each data in the training set Vap:98 . This leads to a convex optimization problem with a large number of functional constraints.

Contributions. To deal with such optimization problems having (possibly) infinite number of functional constraints, we propose subgradient methods with random feasibility updates. At each iteration, the algorithms take a subgradient step aimed at only minimizing the objective function, followed by a feasibility step for minimizing the feasibility violation of the observed minibatch of convex constraints achieved through the Polyak’s subgradient iteration Pol:67 ; Pol:01 . The feasibility updates in the first algorithm are performed using parallel random observations of several constraint components, while in the second algorithm we consider sequential random observations of constraints. Both algorithms are reminiscent of a learning process where we try to learn the constraint set while simultaneously minimizing an objective function. The proposed algorithms are applicable to the situation where the whole constraint set of the problem is not known in advance, but it is rather learned in time through observations. Also, these algorithms are of interest for (non-smooth) constrained optimization problems where the constraints are known but their number is either large or not finite.

We study the convergence properties of the proposed random minibatch subgradient algorithms for the case when the objective function need not be differentiable but it is strongly convex, while the functional constraints are accessed trough a bounded first-order black-box oracle. In doing so, we can avoid the need for projections to the set of constraints, which may be expensive computationally. For a diminishing stepsize, we prove sublinear convergence rates of order ${\cal O}(1/t)$ , where $t$ is the iteration counter, for the expected distances of the weighted averages of the iterates from the constraint set, as well as for the expected suboptimality of the function values along the weighted averages. Our convergence rates are known to be optimal for this class of subgradient schemes for solving non-smooth convex problems with functional constraints. Moreover, our rates depend explicitly on the minibatch size and show when minibatching works for a subgradient method with random feasibility updates. To the best of our knowledge, this is the first work proving that subgradient methods with random minibatch feasibility steps are better than their non-minibatch variants. More explicitly, the convergence estimate for the parallel algorithm depends on a key parameter $L_{N}$ (see eq. (15) below), which determines whether minibatching helps ( $L_{N}<1$ ) or not ( $L_{N}=1$ ) and how much (the smaller $L_{N}$ , the better is the complexity), see Theorem 2. For the sequential variant, we show that minibatching always helps and the complexity depends exponentially on the minibatch size (see Theorem 3).

Related works. In spite of its wide applicability, the study on efficient solution methods for optimization problems with many constraints is still limited. The most prominent work is the stochastic gradient descent (SGD) MouBac:11 ; NemJud:09 ; PolJud:92 . Even though SGD is a well-developed methodology, it only applies to optimization problems with simple constraints, requiring the whole feasible set to be projectable. A line of work that is known as alternating projections, focuses on applying random projections for solving problems that are involving the intersection of a (infinite) number of sets. The case when the objective function is not present in the formulation, which corresponds to the convex feasibility problem, is studied e.g. in BauBor:96 ; KeyZho:16 ; Ned:10 ; NecRic:18 . For this particular setting, Ned:10 ; NecRic:18 combines the smoothing technique with (minibatch) SGD, leading to stochastic alternating projection algorithms having linear convergence rates. In PatNec:17 stochastic proximal point type steps are combined with alternating projections for solving stochastic optimization problems with infinite intersection of sets. Stochastic forward-backward algorithms have been also applied to solve optimization problems with many constraints. However, the papers introducing those general algorithms focus on proving only assymptotic convergence results and do not derive convergence rates, or they assume the number of constraints is finite, which is more restricted than our settings BiaHac:17 ; SheTeb:14 ; WanChe:15 . In the case where the number of constraints is finite and the objective function is deterministic, Nesterov’s smoothing framework is studied in BotHen:13 ; OuyGra:12 ; TraFer:18 in the setting of accelerated proximal gradient methods. Incremental subgradient methods or primal-dual approaches were also proposed for solving convex optimization problems with finite intersection of simple sets through an exact penalty reformulation in Ber:11 ; KunBac:18 .

The paper most related to our work is Ned:11 , see also Pol:67 ; Pol:69 ; Pol:01 , where iterative subgradient methods with random feasibility steps are proposed for solving convex problems with functional constraints. Our algorithms are minibatch extensions of the algorithm proposed in Ned:11 . Moreover, in Ned:11 only sublinear convergence rates of order ${\cal O}(1/\sqrt{t})$ are established for convex objective functions, while in this paper we show that ${\cal O}(1/t)$ rates are valid under a relaxed strong convexity condition. Finally, since we deal with minibatching and a relaxed strong convexity assumption, our convergence analysis requires additional insights that differ from that of Ned:11 . Similarly, in PatNec:17 a stochastic optimization problem with infinite intersection of sets is considered and stochastic proximal point steps are combined with alternating projections for solving it. However, in order to prove sublinear convergence rates ${\cal O}(1/t)$ , PatNec:17 requires strongly convex and smooth objective functions, while our results are valid for a more relaxed strong convexity condition and possible non-smooth fuctions. Lastly, PatNec:17 assumes the projectability of individual sets, whereas in our case, the constraints might not be projectable.

Notation. The inner product of two vectors $x$ and $y$ in $\mathbb{R}^{n}$ is denoted by $\langle x,y\rangle$ , while $\|x\|$ denotes the standard Euclidean norm. We write ${\rm dist}(\bar{x},X)$ for the distance of a vector $\bar{x}$ from a closed convex set $X$ , i.e., ${\rm dist}(\bar{x},X)=\min_{x\in X}\|x-\bar{x}\|$ , while $\Pi_{X}[\bar{x}]$ denotes the projection of $\bar{x}$ onto $X$ , i.e., $\Pi_{X}[\bar{x}]=\mathop{\rm argmin}_{x\in X}\|x-\bar{x}\|^{2}$ . For a scalar $a$ , we write $a^{+}=\max\{a,0\}$ . For a convex function $h$ , we denote $s_{h}(x)$ a subgradient of $h$ at $x$ and $\partial h(x)$ denote the set of all subgradients of $h$ at $x$ . If $h$ is differentiable at $x$ , then its gradient is denoted $\nabla h(x)$ . We write $\mathsf{Pr}\left\{\omega\right\}$ and $\mathsf{E}\!\left[\omega\right]$ to denote respectively the probability distribution and the expectation of a random variable $\omega$ . Finally, the big $\mathcal{O}$ notation, i.e. $f(t)\leq\mathcal{O}(g(t))$ , means that there exist $C>0\;\text{and}\;t_{0}$ such that $f(t)\leq C\cdot g(t)$ for all $t\geq t_{0}$ .

Outline. The content of the paper is as follows. In Section 1.1 we introduce our problem of interest and the main assumptions. In Section 2 we propose a parallel random minibatch subgradient algorithm and derive its convergence rate, while in Section 3 the sequential variant is analyzed. Finally, in Section 4 we discuss some extensions, while in Section 5 we report some preliminary numerical results.

1.1 Problem formulation

In this paper we are interested in solving the following convex constrained minimization problem:

[TABLE]

where $\mathscr{A}$ is an arbitrary collection of indices and $Y$ is a closed convex set. The objective function $f$ and all constraint functions $g_{\omega}$ are assumed convex. We also assume that the optimization problem (1) has finite optimum and we let $f^{*}$ and $X^{*}$ denote the optimal value and the optimal set, respectively,

[TABLE]

We work under the premise that the collection $\mathscr{A}$ is large, possibly infinite (even uncountable). Such problems have many applications in engineering, machine learning, computer science, operations research and finance MouBac:11 ; Vap:98 ; PatNec:17 ; BotHei:12 ; RocUry:00 . Let us now formally state the assumptions on the functions $f$ and $g_{\omega}$ , with $\omega\in\mathscr{A}$ , of problem (1).

Assumption 1

Let the following hold:

(a)

The set $Y$ is closed, convex and simple (i.e., easy for projection). The constraint set $X$ and the optimal set $X^{*}$ are non-empty.

(b)

The objective function $f:\mathbb{R}^{n}\to\bar{\mathbb{R}}$ is strongly convex on the set $Y$ with a constant $\mu>0$ , i.e.:

[TABLE]

The subgradients of the function $f$ are uniformly bounded on the set $Y$ , i.e., there is $M_{f}>0$ such that

[TABLE]

(c)

The functional constraints $g_{\omega}:\mathbb{R}^{n}\to\bar{\mathbb{R}}$ are convex, not necessarily differentiable, and have bounded subgradients on the set $Y$ , i.e., there is $M_{g}>0$ such that

[TABLE]

We assume, that the domains of definition of the functions $f$ and $g_{\omega}$ contain $Y$ . It follows immediately from Assumption 1(b) that (see e.g., NecNes:15 ):

[TABLE]

Note that the conditions of Assumption 1(b) may look contradictory since the following relations need to hold:

[TABLE]

where the second inequality follows from the convexity of $f$ and the third one from the Cauchy-Schwartz inequality. This implies that $\|x-x^{*}\|\leq 2M_{f}/\mu$ for any $x\in X\subseteq Y$ . Note that this inequality is always valid provided that the set $Y$ is compact and our optimization model (1) allows us to impose such an assumption on the set $Y$ . Moreover, when the sets $X_{\omega}$ are simple for projection operation, then one may choose an alternative equivalent description of the constraint sets by letting $g_{\omega}(x)={\rm dist}(x,X_{\omega})$ for all $x\in\mathbb{R}^{n}$ . Note that in this case $d(x)=\frac{x-\Pi_{X_{\omega}}[x]}{{\rm dist}(x,X_{\omega})}\in\partial g_{\omega}(x)$ for all $x\not\in X_{\omega}$ . Furthermore, $\|d(x)\|=1$ , thus the subgradients are bounded with $M_{g}=1$ in this case. Therefore, our approach is more general than those from most of the existing works, which usually assume projectability of each $X_{\omega}$ (see also Related works paragraph from Section 1).

2 Parallel random minibatch subgradient algorithm

To solve the convex problem with functional constraints (1), we first propose a subgradient method with parallel random minibatch feasibility updates. More precisely, our first algorithm is a parallel minibatch extension of the algorithm proposed in Ned:11 , leading to the following iterative process:

**Algorithm (parallel case) **

Choose $x^{0}\in Y$ , minibatch size $N\geq 1$ , and stepsizes $\alpha_{k}>0$ and $\beta>0$ . For $k\geq 1$ repeat:

Draw $N$ samples $J_{k}=\{\omega^{k}_{1},\cdots,\omega^{k}_{N}\}\sim\textbf{P}$ .

Compute the following updates:

$\displaystyle v_{k}=\Pi_{Y}[x_{k-1}-\alpha_{k-1}s_{f}(x_{k-1})],$

(3a)

$\displaystyle z_{k}^{i}=v_{k}-\beta\,\frac{g^{+}_{\omega_{k}^{i}}(v_{k})}{\|d_{k}^{i}\|^{2}}\,d_{k}^{i}\quad\hbox{for }i=1:N,$

(3b)

$\displaystyle x_{k}=\Pi_{Y}[\bar{z}_{k}],\quad\hbox{with }\bar{z}_{k}=\frac{1}{N}\sum_{i=1}^{N}z_{k}^{i}.$

(3c)

Here, $\alpha_{k}>0$ and $\beta>0$ are deterministic stepsizes and recall that $s_{f}(x)$ denotes a subgradient of $f$ at $x$ and $g_{\omega}^{+}(x)=\max\{g_{\omega}(x),0\}$ . The method takes one subgradient step for the objective function, followed by $N$ feasibility updates in parallel, which are then averaged and projected onto the set $Y$ . In a parallel implementation, we assume available $N+1$ cores collocated on the same machine, of which one is designated as a central core; the central core sends $v_{k}$ to all other cores, which perform the update (3b) and send their updates to the central core; finally the central core performs the average step (3c) and the optimality step (3a). We note that at each of the feasibility update step $N$ random constraints are selected from the collection of the constraint sets according to the probability distribution P, i.e., the index variable $\omega_{k}^{i}$ is random with values in the set $\mathscr{A}$ . The vector $d_{k}^{i}$ is chosen as $d_{k}^{i}\in\partial g^{+}_{\omega_{k}^{i}}(v_{k})$ if $g^{+}_{\omega_{k}^{i}}(v_{k})>0$ and $d_{k}^{i}=d$ for some $d\neq 0$ if $g^{+}_{\omega_{k}^{i}}(v_{k})=0$ . When $g^{+}_{\omega_{k}^{i}}(v_{k})=0$ , we have $z_{k}^{i}=v_{k}$ for any choice of $d\neq 0$ . Note that the feasibility step (3b) has the special form of Polyak’s subgradient iteration, see e.g., Pol:67 ; Pol:01 . Moreover, when $X_{\omega}$ are projectable, then one chooses $g_{\omega}(x)=g_{\omega}^{+}(x)={\rm dist}(x,X_{\omega})$ for all $x\in\mathbb{R}^{n}$ and the update (3b) becomes a usual projection step:

[TABLE]

The initial point $x_{0}\in Y$ is selected randomly with an arbitrary distribution. The projection on the set $Y$ in the updates (3a) and (3c) is used to ensure that each $v_{k}$ and $x_{k}$ remain in the set $Y$ , over which the functions $f$ and $g_{\omega}$ are assumed to have bounded subgradients. Our next assumption deals with the random variables $\omega_{k}^{i}$ for $i=1:N$ chosen according to the probability distribution P. For this, we introduce the sigma-field $\mathscr{F}_{k}$ induced by the history of the method, i.e., by the realizations of the initial point $x_{0}$ and the variables $\omega_{t}^{i}$ up to main iteration $k$ :

[TABLE]

which contains the same information as the set $\{x_{0}\}\cup\{\{v_{t},x_{t}\}\mid 1\leq t\leq k\}$ . For notational convenience, we will allow $k=0$ by letting $\mathscr{F}_{0}=\{x_{0}\}$ . We impose the following assumption.

Assumption 2

There exists a constant $c\in(0,\infty)$ such that

[TABLE]

Assumption 2 does not require that $J_{k}=\{\omega_{k}^{1},\ldots,\omega_{k}^{N}\}$ are conditionally independent, given $\mathscr{F}_{k-1}$ . For example, when the collection $\mathscr{A}$ is finite, the indices $i\in\mathscr{A}$ can be selected randomly without replacement, i.e., given the realizations of $\omega_{k}^{1}=j_{1},\ldots,\omega_{k}^{i-1}=j_{i-1}$ , the index $\omega_{k}^{i}$ can be random with realizations in $\mathscr{A}\setminus\{j_{1},\ldots,j_{i-1}\}$ . As another example, the index set $\mathscr{A}$ can be partitioned in $N$ disjoint sets $\cup_{i=1}^{N}\mathscr{A}_{1}=\mathscr{A}$ , and each $w_{k}^{i}$ can be uniformly distributed over the index set $\mathscr{A}_{i}$ . Such a sampling allows for a parallel computation of all $z_{k}^{i}$ in the algorithm (3). One can also combine the preceding two possibilities, by using a smaller partition of the set $\mathscr{A}$ , and in each of the partitions choose the corresponding $\omega_{k}^{i}$ sequentially, without replacement. Assumption 2 is crucial in our convergence analysis of method (3). It summarizes all the information we need regarding the distributions of the random variables $\omega_{k}^{i}$ and the initial point $x_{0}$ . A discussion on the equivalence between the Assumption 2 and the linear regularity condition for the sets $(X_{\omega})_{\omega\in\mathscr{A}}$ can be found in Ned:10 ; Ned:11 ; NecRic:18 . When each set $X_{\omega}$ is given by a linear inequality $a_{\omega}^{T}x+b_{\omega}\leq 0$ , one can verify that the intersection of these halfspaces over any arbitrary index set $\mathscr{A}$ is linearly regular provided that the sequence $(a_{\omega})_{\omega\in\mathscr{A}}$ is bounded, see BurFer:93 ; FerNec:19 . Hence, Assumption 2 is also satisfied in this case. Moreover, Assumption 2 holds provided that the interior of the intersection over the arbitrary index set $\mathscr{A}$ has an interior point Pol:01 . However, Assumption 2 holds for more general sets, e.g., when a strengthened Slater condition holds for a collection of convex functional constraints $(X_{\omega})_{\omega\in\mathscr{A}}$ , such as the generalized Robinson condition, as detailed in Corollary 2 of LewPan:98 .

2.1 Preliminary results

In this section, we derive some preliminary results for later use in the convergence analysis of method (3). We start by recalling a basic property of the projection operation on a closed convex set $Y\subseteq\mathbb{R}^{n}$ Ned:10 :

[TABLE]

We now show that the parameter $c$ in Assumption 2 satisfies the following inequality:

Lemma 1

Let Assumption 1(c) and Assumption 2 hold. Then, we have:

[TABLE]

Proof

Let $y\in Y$ be such that $y\not\in X$ . Then, there exists $\bar{\omega}\in\mathscr{A}$ such that the convex function $g_{\bar{\omega}}$ satisfies $g_{\bar{\omega}}(y)>0$ . Consequently, for any $s_{g}(y)\in\partial g_{\bar{\omega}}(y)$ we also have $s_{g}(y)\in\partial g_{\bar{\omega}}^{+}(y)$ , and using convexity of $g_{\bar{\omega}}^{+}$ , we obtain:

[TABLE]

or equivalently

[TABLE]

On the other hand for those $\omega\in\mathscr{A}$ for which $g_{\omega}(y)=0$ we automatically have

[TABLE]

In conclusion, for any $\omega\in\mathscr{A}$ there holds:

[TABLE]

Combining the preceding inequality and Assumption 2, we obtain:

[TABLE]

which proves our relation $cM_{g}^{2}\geq 1$ . $\square$

We now derive a relation between the iterates $v_{k+1}$ and $x_{k}$ .

Lemma 2

Let Assumptions 1(a) and 1(b) hold. Let $v_{k+1}$ be obtained via equation (3a) for a given $x_{k}\in Y$ . Then, for the unique optimal solution $x^{*}$ of the problem (1) and $\rho\in(0,\;1)$ , we have:

[TABLE]

Proof

Using the standard analysis of the projected subgradient method and the fact that the subgradients of $f$ are uniformly bounded on $Y$ , we have for the optimal solution $x^{*}$ of (1) the following inequality, see e.g., Pol:67 ; Pol:69 :

[TABLE]

We provide a lower bound on $f(x_{k})-f(x^{*})$ . We consider two choices, namely, one is based on the strong convexity of $f$ and the other is based on considering another intermittent point. By the strong convexity of $f$ , we have

[TABLE]

where the second inequality follows from the optimality conditions for $x^{*}$ and the last inequality follows from the Cauchy-Schwartz and boundedness of the subgradients of $f$ on $Y$ . The other choice consists of adding and subtracting $f(\Pi_{X}[x_{k}])$ , which yields

[TABLE]

where the last inequality follows by the convexity of $f$ and the Cauchy-Schwarz inequality. By Assumption 1(b), the subgradients of $f$ are uniformly bounded on $Y$ and hence, also on $X$ , implying that

[TABLE]

We now let $\rho\in(0,1)$ be arbitrary. By multiplying relation (2.1) with $\rho$ and relation (7) with $(1-\rho)$ , and by adding the resulting relations, we obtain

[TABLE]

By using the estimate (2.1) in relation (5), we obtain

[TABLE]

and after re-arranging some of the terms we get the relation of the lemma. $\square$

Remark 1

The best choice for the parameter $\rho$ is not apparent at this point. It is important to have it in order to have the function value involved in the expression, but it can be that $\rho=\frac{1}{2}$ will just do fine.

We next state a result that will be used to provide a basic relation between the iterates $v_{k}$ and $x_{k-1}$ . The relation is stated in a generic form, and its proof can be found in Pol:69 ; Pol:67 .

Lemma 3

Pol:69 ; Pol:67 *

Let $g$ be a convex function over a closed convex set $Z$ , and let $y$ be given by*

[TABLE]

where $d\neq 0$ . Then, for any $\bar{z}\in Z$ such that $g^{+}(\bar{z})=0$ , we have

[TABLE]

In the analysis, we will also make use of the relation for averages, stating that for given vectors $u_{1},\ldots,u_{N}\in\mathbb{R}^{n}$ and their average $\bar{u}=\frac{1}{N}\sum_{i=1}^{N}u_{i}$ , the following relation is valid for any vector $w\in\mathbb{R}^{n}$ :

[TABLE]

Now we provide a basic relation for the iterate $x_{k}$ upon completion of the $N$ randomly sampled feasibility updates.

Lemma 4

*Let Assumption 1(a) hold. Let $x_{k}$ be obtained via updates (3b) and (3c) for a given $v_{k}\in Y$ and $\beta>0$ . Then, the following relation holds: *

[TABLE]

where $V_{N}(v_{k})$ is the total variation of the minibatch subgradients, i.e.,

[TABLE]

Proof

By the projection property (4) and the definition of $x_{k}$ , we have for any $y\in X$ that:

[TABLE]

By the definition we have $\bar{z}_{k}=\frac{1}{N}\sum_{i=1}^{N}z_{k}^{i}$ . Thus, by using relation (11) for the collection $z_{k}^{1},\ldots,z_{k}^{N}$ , we have for any $w\in\mathbb{R}^{n}$ ,

[TABLE]

Letting $w=y$ in the preceding relation and combining the resulting relation with (12), we obtain

[TABLE]

Now, we use the definition of the iterates $z_{k}^{i}$ in algorithm (3) and Lemma 3, with $Z=\mathbb{R}^{n}$ . Thus, we obtain for any $y\in X$ (for which we would have $g_{\omega_{k}^{i}}^{+}(y)=0$ for any realization of $\omega_{k}^{i}$ ) and for any $i=1,\ldots,N$ ,

[TABLE]

Hence, it follows that for any $y\in X$ ,

[TABLE]

From the definition of the iterates $z_{k}^{i}$ in algorithm (3), we see that

[TABLE]

By defining

[TABLE]

we have

[TABLE]

Therefore, we obtain for any $y\in X$ ,

[TABLE]

The statement of the lemma follows by letting $y=\Pi_{X}[v_{k}]$ in the preceding relation and using the fact that $\|x_{k}-\Pi_{X}[x_{k}]\|\leq\|x_{k}-\Pi_{X}[v_{k}\|\|$ . $\square$

Let us define the following parameters:

[TABLE]

From Jensen’s inequality it follows that $L_{N}^{k}\leq 1$ . However, there are also convex functions $g_{\omega}$ such that $L_{N}^{k}<1$ . We postpone the derivation of such examples of functional constraints satisfying condition $L_{N}^{k}<1$ until Section 2.3. The parameter $L_{N}\leq 1$ will play a key role in our derivations below. In particular, we obtain the following simplification for Lemma 4.

Lemma 5

Let Assumptions 1(a) and 1(c) hold. Let $L_{N}\leq 1$ as defined in (15) and $x_{k}$ be obtained via updates (3b) and (3c) for a given $v_{k}\in Y$ and extrapolated stepsize $\beta\in(0,\,2/L_{N})$ . Then, the following relation holds:

[TABLE]

Proof

Note that the total variation of the minibatch subgradients $V_{N}(v_{k})$ can be written equivalently as:

[TABLE]

Using the previous expression of $V_{N}$ and the definitions of $L_{N}^{k}$ and $L_{N}$ from (15) in Lemma 4, we get:

[TABLE]

By Assumption 1(c) each function $g_{i}$ has bounded subgradients uniformly on $Y$ . Hence, we have $\|d_{k}^{i}\|\leq M_{g}$ , which used in the previous inequality implies the statement of the lemma. $\square$

Note that the previous result shows that we can use extrapolated stepsize $\beta\in(0,2/L_{N})$ in minibatch settings instead of the typical $\beta\in(0,2)$ used e.g. in Ned:11 . Clearly, when $L_{N}<1$ we have $2/L_{N}>2$ and consequently, such extrapolation will accelerate convergence of the parallel algorithm. This can be also observed in simulations (see e.g. Fig. 3 below). Moreover, the largest decrease in Lemma 5 is obtained by maximizing $\beta(2-\beta L_{N})$ , that is, the optimal stepsize is $\beta=1/L_{N}$ . We now combine Lemma 2 and Lemma 5 to provide a basic relation for the subsequent analysis.

Lemma 6

Consider the method in (3), and let Assumption 1 hold. Let the stepsize $\alpha_{k}$ be such that $1-\frac{\alpha_{k}\mu}{2}>0$ for all $k\geq 0$ and stepsize $\beta\in(0,\;2/L_{N})$ , with $L_{N}\leq 1$ defined in (15). Then, the iterates of the method (3) satisfy the following recurrence for the optimal solution $x^{*}$ and for all $k\geq 0$ :

[TABLE]

where $\eta>0$ is arbitrary.

Proof

Let $x^{*}\in X$ be the unique optimal solution of problem (1). Then, we use Lemma 2 for $\rho=\frac{1}{2}$ so that for all $k\geq 0$ , we have

[TABLE]

Using the same reasoning as in the proof of Lemma 5 for the inequality (14) with $y=x^{*}$ gives:

[TABLE]

Combining the preceding two relations yields

[TABLE]

We next approximate the term that is linear in $\alpha_{k}$ , i.e. $2\alpha_{k}M_{f}\|\Pi_{X}[x_{k}]-x_{k}\|$ , with a sum of two quadratic terms, one of which is in the order of $\alpha_{k}^{2}$ , as:

[TABLE]

for any arbitrary $\eta>0$ . Substituting the preceding estimate in (16), we obtain the stated relation. $\square$

2.2 Convergence rates

In this section we derive the convergence rates of Algorithm (3). For this, we first provide a recurrence relation for the iterates in expectation, which is the key relation for our convergence rate results. Note that $cM_{g}^{2}\geq 1$ according to Lemma 1 and $L_{N}\in(0,1]$ . In the sequel we provide a detailed convergence analysis for the non-trivial case $cM_{g}^{2}L_{N}>1$ . The other case, i.e. $cM_{g}^{2}L_{N}\leq 1$ , implies almost sure feasibility for any $x_{t}$ generated by the parallel algorithm, with $t\geq 1$ , and it will be discussed in Remark 3.

Theorem 2.1

Consider the iterative process (3), and let Assumption 1 and Assumption 2 hold. Let the stepsizes $\alpha_{k}$ be such that $1-\frac{\alpha_{k}\mu}{2}>0$ for all $k\geq 0$ and $\beta\in(0,\;2/L_{N})$ , with $L_{N}\leq 1$ defined in (15), and assume $cM_{g}^{2}L_{N}>1$ . Then, for the algorithm (3), by defining $q_{N}=\frac{\beta(2-\beta L_{N})}{cM_{g}^{2}}<1$ , we have almost surely for all $k\geq 0$ ,

[TABLE]

Proof

From Lemma 6, by taking the conditional expectation on the past $\mathscr{F}_{k-1}$ , we have almost surely for all $k\geq 0$ ,

[TABLE]

where $\eta>0$ is arbitrary. By Assumption 2, it follows that

[TABLE]

Hence

[TABLE]

Taking the conditional expectation on the past $\mathscr{F}_{k-1}$ in the relation of Lemma 5, and using relation (20), we obtain almost surely

[TABLE]

where we denote

[TABLE]

Recall that we assume $cM_{g}^{2}L_{N}>1$ , then $q_{N}<1$ (since $\max_{\beta}\beta(2-\beta L_{N})=1/L_{N}$ ). Hence, $1-q_{N}>0$ . By dividing with $1-q_{N}$ , we further obtain

[TABLE]

Substituting the preceding estimate in relation (20), yields

[TABLE]

We now use estimate (23) in relation (17), and thus obtain

[TABLE]

By the definition of $q$ (see (22)), we have

[TABLE]

Hence,

[TABLE]

and by letting $\eta=\frac{1}{2}\left(1-\frac{\alpha_{k}\mu}{2}\right)\frac{q_{N}}{1-q_{N}}>0$ , the desired relation follows. $\square$

We now turn our attention to the stepsize $\alpha_{k}$ . We consider $\alpha_{k}$ of the form:

[TABLE]

for some diminishing sequence $\gamma_{k}$ as detailed below. Indeed, for this choice, the recurrence from Theorem 2.1 becomes:

[TABLE]

where recall that $q_{N}=\frac{\beta(2-\beta L_{N})}{cM_{g}^{2}}$ . Let $\gamma_{k}$ be given by

[TABLE]

Since the sequence $\gamma_{k}$ is decreasing, we have

[TABLE]

implying that

[TABLE]

Using this estimate in (24), we obtain

[TABLE]

Next, we note that

[TABLE]

Dividing (2.2) by $\gamma_{k}^{2}$ and using the preceding inequality we have for all $k\geq 1$ , after taking total expectations and rearranging terms:

[TABLE]

Summing these over $k=1,\ldots,t$ , for some $t>0$ , we obtain

[TABLE]

Using the definition of $\gamma_{k}$ , (28) implies

[TABLE]

We finally obtain by the linearity of the expectation operation:

[TABLE]

Define for $t\geq 1$ the sum

[TABLE]

Define also the following weighted averages (convex combinations)

[TABLE]

with $a_{k}=\frac{(k+1)^{2}}{S_{t}}$ , hence satisfying $\sum_{k=1}^{t}a_{k}=1$ . Using convexity of the function $f$ and of the norm-squared, we have

[TABLE]

If we define $b_{N}^{p}=q_{N}(1-q_{N})^{-1}=(1-q_{N})^{-1}-1$ , then (33) becomes:

[TABLE]

Next theorem summarizes the convergence rates followed from the previous discussion. For simplicity of the exposition, we omit the constants and express the rates only in terms of the dominant powers of $t$ :

Theorem 2.2

Let Assumption 1 and Assumption 2 hold and the stepsizes $\alpha_{k}=\frac{4}{\mu(k+1)}$ and $\beta\in(0,\;2/L_{N})$ , with $L_{N}\leq 1$ defined in (15). Let also assume $cM_{g}^{2}L_{N}>1$ . Then, $q_{N}=\frac{\beta(2-\beta L_{N})}{cM_{g}^{2}}<1$ and $b_{N}^{p}=(1-q_{N})^{-1}-1$ . Moreover, the following sublinear rates for suboptimality and feasibility violation hold for the average sequence $\hat{x}_{t}$ generated by the parallel algorithm (3):

[TABLE]

Proof

From the recurrence (35), omitting the constants but keeping the terms depending on $b_{N}^{p}=(1-q_{N})^{-1}-1$ , we get the following convergence rates in terms of these weighted averages $\hat{w}_{t}$ and $\hat{x}_{t}$ :

[TABLE]

Since $\hat{w}_{t}\in X$ and using the Jensen’s inequality we get the following convergence rate for the feasibility violation of the constraints:

[TABLE]

Since $\hat{x}_{t}\in Y$ and $\hat{w}_{t}\in X\subset Y$ , by the subgradient boundedness of $f$ on $Y$ , it follows that

[TABLE]

which combined with $\mathsf{E}\!\left[f(\hat{w}_{t})-f^{*})\right]\leq{\cal O}\left(\frac{1}{t}+\frac{1}{b_{N}^{p}t}\right)$ , yields also the following convergence rate for suboptimality

[TABLE]

which proves our theorem. $\square$

We observe that the convergence estimate for the feasibility violation depends explicitly on the minibatch size $N$ via the key parameter $L_{N}$ . For the optimal stepsize $\beta=1/L_{N}$ we get $q_{N}=1/cM_{g}^{2}L_{N}$ and $b_{N}^{p}=1/(cM_{g}^{2}L_{N}-1)$ . Hence, $b_{N}^{p}$ is large provided that $L_{N}\ll 1$ (small). Note that if $L_{N}=1$ , then $b_{N}^{p}$ does not depend on $N$ and hence complexity does not improve with minibatch size $N$ . However, as long as $L_{N}<1$ (and it can be also the case that $L_{N}\sim 0$ ), then $b_{N}^{p}$ becomes large, which shows that minibatching improves complexity. To the best of our knowledge, this is the first time that a subgradient method with random minibatch feasibility updates is shown to be better than its non-minibatch variant. We have identified $L_{N}$ as the key quantity determining whether minibatching helps ( $L_{N}<1$ ) or not ( $L_{N}=1$ ), and how much (the smaller $L_{N}$ , the more it helps). Note also that the suboptimality estimate contains a term which does not depend on the minibatch size $N$ as it happens for feasibility violation estimate. This is natural, since the minibatch feasibility steps have no effect on the minimization step of the objective function.

Remark 2

Note that the convergence rates ${\cal O}\left(\frac{1}{t}\right)$ for feasibility and suboptimality are known to be optimal for the stochastic subgradient method for solving the optimization problem (1) under Assumption 1, see NemYud:83 ; Nes:04 . Moreover, the iterative process (3) does not require knowledge of the subgradient norm bounds $M_{f}$ and $M_{g}$ from Assumption 1, nor the constant $c$ from Assumption 2. These values are only affecting the constants in the convergence rates, they are not needed for the stepsize selection. The stepsize $\alpha_{k}$ requires only knowledge of some estimate of the strong convexity constant $\mu$ . Moreover, since $L_{N}\leq 1$ , we can use e.g., stepsize $\beta\in(0,\;2)\subseteq(0,\;2/L_{N})$ . Of course, a larger stepsize $\beta$ leads to a faster convergence. Hence, if $L_{N}<1$ and it can be computed, then we should choose an extrapolated steplength $\beta=(2-\delta)/L_{N}$ for some $\delta\in(0,2)$ small. When $L_{N}$ cannot be computed explicitly, we propose to approximate it online with $L_{N}^{k}$ , and use at each iteration an adaptive extrapolated stepsize $\beta_{k}$ of the form $\beta_{k}=(2-\delta)/L_{N}^{k}$ for some $\delta\in(0,\;2)$ (see also the discussion from Section 4, equation (48)).

Remark 3

The convergence rates from Theorem 2.2 hold for the non-trivial case $q_{N}=\frac{\beta(2-\beta L_{N})}{cM_{g}^{2}}<1$ . Note that the inequality $q_{N}<1$ is always satisfied, provided that $cM_{g}^{2}L_{N}>1$ . On the other hand, the case $q_{N}\geq 1$ (e.g., $cM_{g}^{2}L_{N}\leq 1$ and $\beta=1/L_{N}$ ) turns out to be the ideal case, since then we have from (21) that $\mathsf{E}\!\left[{\rm dist}^{2}(x_{k},X)\mid\mathscr{F}_{k-1}\right]\leq 0\quad\forall k\geq 1$ . Therefore, in this ideal case we achieve almost sure feasibility for the sequence $x_{t}$ generated by the parallel algorithm (see (3)) after one step:

[TABLE]

Using this feasibility relation in the same derivations from Section 2.2 we also get a suboptimality estimate for the average sequence $\hat{x}_{t}$ as in Theorem 2.2:

[TABLE]

Clearly, from Jensen’s inequality we also have almost sure feasibility for the average sequence $\hat{x}_{t}$ :

[TABLE]

We skip these details since the proof is the same as for the non-trivial case.

2.3 Example of functional constraints having $L_{N}<1$

Let us recall the definition of the parameters $L_{N}^{k}$ and $L_{N}$ from (15):

[TABLE]

From Jensen’s inequality we have $L_{N}^{k}\leq 1$ and consequently $L_{N}\leq 1$ . On the other hand, Theorem 2.2 shows that $L_{N}\ll 1$ is beneficial for a subgradient scheme with minibatch feasibility updates. In this section we provide an example of functional constraints $g_{\omega}$ for which $L_{N}<1$ . Let us consider $m$ linear inequality constraints for the convex problem (1):

[TABLE]

Without loss of generality we assume $\|a_{\omega}\|=1$ for all $\omega$ . Let us define the matrix $A=[a_{1}\cdots a_{m}]^{T}$ and the subset of indexes selected at the current iteration $J_{k}=\{\omega^{1}_{k}\cdots\omega^{N}_{k}\}\subset\mathscr{A}$ . We also denote $J_{k}^{+}=\{\omega\in J_{k}:a_{\omega}^{T}v_{k}+b_{\omega}>0\}$ and denote $A_{J_{k}^{+}}$ the submatrix of $A$ having the rows indexed in the set $J_{k}^{+}$ . With these notations and using that $\|a_{\omega}\|=1$ for all $\omega$ , then $L_{N}^{k}$ can be written explicitly as (assuming that $|J_{k}^{+}|\geq 1$ ):

[TABLE]

where the first inequality follows from the definition of the maximal eigenvalue $\lambda_{\max}$ of a matrix, the second inequality follows from the fact that $J_{k}^{+}\subseteq J_{k}$ , and the third inequality holds strictly provided that the submatrix $A_{J_{k}}$ has at least rank two. In conclusion, if the matrix $A$ has e.g. full row rank and consider a sampling of $J_{k}$ based on a given probability P, then $L_{N}$ satisfies:

[TABLE]

Note that for particular sampling rules we can compute $L_{N}$ efficiently, such as when we consider a uniform distribution over a fixed partition of $\mathscr{A}=\cup_{i=1}^{\ell}J_{i}$ of equal size. The reader may find other examples of functional constraints satisfying $L_{N}<1$ and we believe that this paper opens a window of opportunities for algorithmic research in this direction.

3 Sequential random minibatch subgradient algorithm

In this section we consider a sequential variant of the algorithm (3) defined in terms of the following iterative process:

**Algorithm (sequential case) **

Choose $x^{0}\in Y$ , minibatch size $N\geq 1$ , and stepsizes $\alpha_{k}>0$ and $\beta>0$ . For $k\geq 1$ repeat:

Draw $N$ samples $J_{k}=\{\omega^{k}_{1},\cdots,\omega^{k}_{N}\}\sim\textbf{P}$ .

Compute the following updates:

$\displaystyle v_{k}=\Pi_{Y}[x_{k-1}-\alpha_{k-1}s_{f}(x_{k-1})],$

(38a)

$\displaystyle z_{k}^{0}=v_{k},\;z_{k}^{i}=\Pi_{Y}\!\left[z_{k}^{i-1}-\beta\,\frac{g^{+}_{\omega_{k}^{i}}(z_{k}^{i-1})}{\|d_{k}^{i}\|^{2}}\,d_{k}^{i}\right]\;\hbox{for }i\!=\!1\!:\!N,$

(38b)

$\displaystyle x_{k}=z_{k}^{N}.$

(38c)

This method takes, as for the parallel variant, one subgradient step for the objective function, followed by $N$ sequential feasibility updates. As before, the vector $d_{k}^{i}$ is chosen as $d_{k}^{i}\in\partial g^{+}_{\omega_{k}^{i}}(z_{k}^{i-1})$ if $g^{+}_{\omega_{k}^{i}}(v_{k})>0$ , and $d_{k}^{i}=d$ for some $d\neq 0$ if $g^{+}_{\omega_{k}^{i}}(z_{k}^{i-1})=0$ . Note that in this variant, the feasibility updates use the projection on $Y$ in order to confine the intermittent iterates $z_{k}^{i}$ and $x_{k}$ to the set $Y$ , where $g_{\omega}$ ’s and $f$ (for the last step) are assumed to have uniformly bounded subgradients.

In this section we analyze the convergence properties of this new algorithm (38). Given $x_{k-1}$ , the update of $v_{k}$ is the same as in the parallel method (3), thus Lemma 2 still applies here. We need an analog of Lemma 5.

Lemma 7

Let Assumptions 1(a) and 1(c) hold. Let $x_{k}$ be generated by algorithm (38) with $\beta\in(0,\;2)$ . Then, the following relations are valid:

[TABLE]

Proof

We start with the definition of $z_{k}^{i}$ in (38b) and Lemma 3, with $Z=Y$ . Thus, we obtain for all $y\in X$ (which satisfies $g_{\omega_{k}^{i}}^{+}(y)=0$ for any realization of $\omega_{k}^{i}$ ) and for all $i=1,\ldots,N$ ,

[TABLE]

By using $\|d_{k}^{i}\|^{2}\leq M_{g}^{2}$ , we have for all $i=1,\ldots,N$ ,

[TABLE]

The distance relation for $z$ -iterates follows by taking the minimum over $y\in X$ on both sides of inequality (39). By summing relations (39) over $i=1,\ldots,N$ , and by using $z^{0}_{k}=v_{k}$ and $z^{N}_{k}=x_{k}$ , we obtain for any $y\in X$ ,

[TABLE]

The distance relation follows by taking the minimum over $y\in X$ on both sides of the preceding inequality. $\square$

Taking $\rho=1/2$ in Lemma 2 we get:

[TABLE]

and using the inequality for $\|x_{k}-y\|^{2}$ from Lemma 7 in $y=x^{*}$ , yields:

[TABLE]

Taking the conditional expectation on $\mathscr{F}_{k-1}$ and $z_{k}^{i-1}$ , and using Assumption 2, give

[TABLE]

Using the iterated expectation rule, we obtain

[TABLE]

which, when combined with the distance relation of Lemma 7 gives for all $i=1,\ldots,N$

[TABLE]

Recall that $cM_{g}^{2}\geq 1$ according to Lemma 1. In the subsequent analysis we consider the non-trivial case $cM_{g}^{2}>1$ . The ideal case $cM_{g}^{2}=1$ will allow to get feasibility in expectation in one step and obtain a similar convergence rate result as in Remark 3. Hence, using the definition of $x_{k}$ , i.e., $x_{k}=z_{k}^{N}$ , and letting $q=\frac{\beta(2-\beta)}{cM_{g}^{2}}\in(0,1)$ (since we assume $cM_{g}^{2}>1$ and $\beta\in(0,2)$ ), we have for all $i=1,\ldots,N$ ,

[TABLE]

implying that for all $i=1,\ldots,N$ ,

[TABLE]

From (41) and (42) for all $i=1,\ldots,N$ ,

[TABLE]

By summing over $i$

[TABLE]

However,

[TABLE]

Finally, we get

[TABLE]

and consequently

[TABLE]

Let us denote $b_{N}^{s}=(1-q)^{-N}-1$ . It is clear that $b_{N}^{s}\to\infty$ as $N\to\infty$ . Taking expectation in (40) and using the previous inequality we get an analog of Lemma 6:

[TABLE]

for any $\eta>0$ . Let us consider the same stepsize as for the parallel scheme, i.e. $\alpha_{k}=\frac{2}{\mu}\gamma_{k}$ , choose $\eta=\frac{1}{2}\left(1-\frac{\alpha_{k}\mu}{2}\right)b_{N}^{s}>0$ , and take the full expectation, to get the following recurrence (analog to Theorem 2.1):

[TABLE]

Using now $\gamma_{k}=\frac{2}{k+1}$ , then $1-\gamma_{k}\geq\frac{1}{3}$ and we get:

[TABLE]

Since, $\frac{1-\gamma_{k}}{\gamma_{k}^{2}}\leq\frac{1}{\gamma_{k-1}^{2}}$ for all $k\geq 1$ , dividing (43) by $\gamma_{k}^{2}$ and using the preceding inequality we have for all $k\geq 1$ :

[TABLE]

Summing these over $k=1,\ldots,t$ , for some $t>0$ , we obtain the following recurrence relation for the algorithm (38):

[TABLE]

Using the same definition for the weighted averages $\hat{w}_{t}$ and $\hat{x}_{t}$ from (32) and $\gamma_{k}=\frac{2}{k+1}$ in (45), we get the main recurrence for the sequential variant (38):

[TABLE]

Next theorem summarizes the convergence rates that follow from the recurrence relation (46) of the sequential algorithm (38).

Theorem 3.1

Let Assumption 1 and Assumption 2 hold and the stepsizes $\beta\in(0,2)$ and $\alpha_{k}=\frac{4}{\mu(k+1)}$ . Let also $q=\frac{\beta(2-\beta)}{cM_{g}^{2}}<1$ and $b_{N}^{s}=(1-q)^{-N}-1$ . Then, the following sublinear rates for suboptimality and feasibility violation hold for the average sequence $\hat{x}_{t}$ from (32) generated by the sequential algorithm (38):

[TABLE]

Proof

Defining the same average sequences $\hat{w}_{t}$ and $\hat{x}_{t}$ as in (32), we get the following convergence rates (omitting the constants but keeping the terms depending on $b_{N}^{s}$ ):

[TABLE]

Hence, we get the following convergence rate for the feasibility violation of the constraints that depends explicitly on the minibatch size $N$ via the term $b_{N}$ :

[TABLE]

Using the same reasoning as in the proof of Theorem 2.2, we also get the following convergence rate for suboptimality:

[TABLE]

which proves the statements of the theorem. $\square$

We observe that also for the sequential algorithm (38) the convergence estimate for the feasibility violation depends explicitly on the minibatch size $N$ via the term $b_{N}^{s}$ (recall that $b_{N}^{s}\to\infty$ as $N\to\infty$ ). Since $b_{N}^{s}$ is an increasing sequence in $N$ , it follows that the larger is the minibatch size $N$ the better is also the complexity of the sequential algorithm (38) in terms of constraints feasibility. In conclusion, for the sequential variant our rates prove that minibatching always helps and the feasibility estimate depends exponentially on the minibatch size $N$ . On the other hand, the suboptimality estimate contains a term which does not depend on the minibatch size $N$ as it happens for feasibility violation estimate. Recall that for the parallel algorithm we proved that minibatching works only for $L_{N}<1$ and the estimates depend linearly on $L_{N}$ .

4 Extensions

In this section we discuss some possible extensions of the framework presented in this paper related to the objective function, algorithms and stepsizes. Some of these extensions will be considered in our future work.

First, from our convergence analysis it is easy to note that the derivations still remain valid for a larger class of objective functions in the model (1). More precisely, we can replace the boundedness on the subgradients of $f$ (Assumption 1(b)), i.e. $\|s_{f}(x)\|\leq M_{f}$ , with a more general assumption, that is there exist two constants $M_{f,1},M_{f,2}\geq 0$ such that the (sub)gradients of $f$ satisfy the following inequality:

[TABLE]

Clearly, this condition covers the class of functions with bounded subgradients, e.g. take $M_{f,2}=0$ , and also the class of functions with Lipschitz continuous gradients Nes:04 . Indeed, if there is $L_{f}>0$ such that the gradients $\nabla f$ satisfy:

[TABLE]

then we immediately get

[TABLE]

which proves our inequality for $M_{f,1}=\max_{x\in X^{*}}\|\nabla f(x^{*})\|$ and $M_{f,2}=L_{f}$ . Our convergence analysis can be easily adapted for this more general assumption, however, the recurrence relations will be more cumbersome. For example, the recurrence from Lemma 2 becomes now:

[TABLE]

Second, when the objective function $f$ has an easy proximal operator we can replace the subgradient steps (3a) and (38a) by a proximal point step:

[TABLE]

An algorithm combining the proximal point step with a single feasibility step (i.e., $N=1$ ) has been considered in PatNec:17 and convergence rates of order $\mathcal{O}(1/t)$ have been proved provided that the objective function is smooth (i.e., it has Lipschitz continuous gradient) and strongly convex. Note that it is easy to extend that convergence analysis to the minibatch settings following the framework developed in this paper.

Third extension is still related to the objective function, by considering $f$ in the composite form, i.e.:

[TABLE]

where $f_{1}$ is smooth and $f_{2}$ can be non-smooth but admits an easy proximal operator. Note that if the set $Y$ is present in the optimization model (1), then it can be included in $f_{2}$ as the indicator function. For this composite objective function, steps (3a) and (38a) can be replaced by:

[TABLE]

Note that for $f_{2}(x)=1_{Y}(x)$ , the indicator function of the convex set $Y$ , we recover the updates (3a) and (38a). Hence, it will be interesting to extend our convergence analysis to this general composite objective function $f$ .

Finally, in the parallel algorithm (see (3)) the feasibility steps depend on an extrapolated stepsize $\beta\in(0,\,2/L_{N})$ . When $L_{N}$ cannot be computed explicitly, we propose to approximate it online with $L_{N}^{k}$ , and use at each iteration an adaptive extrapolated stepsize $\beta_{k}$ of the form:

[TABLE]

for some $\delta\in(0,\;2)$ sufficiently small. The convergence rate of the parallel algorithm for this adaptive choice (48) of the stepsize $\beta_{k}$ will be analyzed in our future work (see e.g., NecNed:19 for some preliminary results related to the convex feasibility problem).

5 Preliminary numerical results

Many data-driven optimization applications can be formulated as convex optimization problems with the objective function composed of a quadratic term and a regularizer and constraints (so-called constrained Lasso) of the form:

[TABLE]

where the problem is parametrised by the data (measurements) $y$ , $H$ is an appropriate linear operator (e.g., the forward operator, the circular convolution) and $D$ is another linear operator (e.g., the identity, the finite difference or the Wavelet transform). Additionally, we impose constraints of the form $x\in X=[l,\;u]\cap\{x:\;Ax+b\leq 0\}$ , where $A\in\mathbb{R}^{m\times n}$ . The constrained Lasso arises e.g., in image deblurring or denoising, computerised tomography or some inverse problems, see.e.g BotHei:12 . Note that for this formulation the strong convexity Assumption 1(b) holds for full column matrices $H$ (see e.g., NecNes:15 ) and also the linear regularity Assumption 2 holds (see e.g. NecRic:18 ). Moreover, the set $Y=[l,\,u]$ is compact so that the objective function has bounded subgradients and the functional constraints $g_{\omega}(x)=a_{\omega}^{T}x+b_{\omega}$ are linear and, consequently, Assumption 1 also holds.

In our experiments we use synthetic data, where $H$ is Toeplitz-like matrix and $D$ the finite difference operator (as in image deblurring BotHei:12 ). We also generate $A$ randomly, with $m=3n$ constraints. We consider a partition of $\mathscr{A}=\{1,2,\cdots,m\}$ of equal size $N$ , i.e., $\mathscr{A}=\cup_{i=1}^{\ell}J_{i}$ . Hence, $m=N\cdot\ell$ . We compute $L_{N}$ as in (37) for this partition. We consider full iterations, i.e. we plot the behavior of the algorithms over epochs $tN/m$ (number of passes over all the rows of matrix $A$ ).

In the first set of experiments we compare the parallel (see (3)) and sequential (see (38)) algorithms for different minibatch sizes $N=1,50$ and $100$ on a constrained Lasso problem with $n=10^{3}$ . The plots in Fig. 1 present the convergence behavior of these algorithms in terms of feasibility violation of the average point over full iterations $tN/m$ : parallel algorithm (left) and sequential algorithm (right). As we can see from Fig. 1, increasing the minibatch size $N$ usually leads to better convergence for both algorithms.

Then, we compare the parallel algorithm with the extrapolated stepsize $\beta=1.9/L_{N}$ and the sequential algorithm with $\beta=1.9$ . The results on a problem of dimension $n=10^{3}$ and minibatch size $N=10$ are displayed in Fig. 2: suboptimality (left) and feasibility violation (right) in the average point over full iterations. We observe a faster convergence for the sequential algorithm, as our theory also predicted.

Finally, we compare the parallel algorithm (3) based on our extrapolated stepsize $\beta=1.9/L_{N}$ and a variant with fixed stepsize $\beta=1.9$ . The results on a constrained Lasso problem of dimension $n=10^{3}$ and minibatch size $N=10$ are displayed in Fig. 3: suboptimality (left) and feasibility violation (right). We observe that extrapolation $\beta=1.9/L_{N}>2$ accelerates substantially the parallel algorithm in terms of feasibility criterion. Note also that all the plots show a $\mathcal{O}(1/t)$ rate for the average sequence in the feasibility criterion, thus supporting our theoretical findings.

6 Conclusions

In this paper we have considered (non-smooth) convex optimization problems with (possibly) infinite intersection of constraints. For solving this general class of convex problems we have proposed subgradient algorithms with random minibatch feasibility steps. At each iteration, our algorithms take first a step for minimizing the objective function and then a subsequent step minimizing the feasibility violation of the observed minibatch of constraints. The feasibility updates were performed based on either parallel or sequential random observations of several constraint components. For a diminishing stepsize and for strongly convex objective functions, we have proved sublinear convergence rates for the expected distances of the weighted averages of the iterates from the constraint set, as well as for the expected suboptimality of the function values along the weighted averages. Our convergence rates are optimal for subgradient methods with random feasibility steps for solving this class of non-smooth convex problems. Moreover, the rates depend explicitly on the minibatch size. From our knowledge, this work is the first deriving conditions when minibatching works for subgradient methods with random minibatch feasibility updates and proving how better is their complexity compared to the non-minibatch variants. Finally, our convergence analysis shows that for the sequential algorithm minibatching always helps and the feasibility estimate depends exponentially on the minibatch size, while for the parallel algorithm we proved that minibatching works only when some parameter of the optimization problem is strictly less than 1. The numerical results also support the convergence results.

Acknowledgements.

This research was supported by the National Science Foundation under CAREER grant CMMI 07-42538 and by the Executive Agency for Higher Education, Research and Innovation Funding (UEFISCDI), Romania, PNIII-P4-PCE-2016-0731, project ScaleFreeNet, no. 39/2017.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) D. Blatt and A.O. Hero, Energy based sensor network source localization via projection onto convex sets , IEEE Transactions on Signal Processing, 54(9): 3614–3619, 2006.
2(2) H. Bauschke and J. Borwein, On projection algorithms for solving convex feasibility problems , SIAM Review 38(3): 367–376, 1996.
3(3) R.I. Bot and C. Hendrich, A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems , Computational Optimization and Applications, 54(2): 239–262, 2013.
4(4) R.I. Bot and T. Hein, Iterative regularization with general penalty term - theory and application to L 1 subscript 𝐿 1 L_{1} and TV regularization , Inverse Problems, 28(10): 1–19, 2012.
5(5) J. Burke and M. Ferris, Weak sharp minima in mathematical programming , SIAM Journal of Control and Optimization, 31(6): 1340–1359, 1993.
6(6) P. Bianchi, W. Hachem, and A. Salim, A constant step forward-backward algorithm involving random maximal monotone operators , arxiv preprint (ar Xiv:1702.04144), 2017.
7(7) D.P. Bertsekas, Incremental proximal methods for large scale convex optimization , Mathematical Programming, 129(2): 163–195, 2011.
8(8) O. Fercoq, A. Alacaoglu, I. Necoara and V. Cevher, Almost surely constrained convex optimization , International Conference on Machine Learning (ICML), 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Random minibatch subgradient algorithms for convex problems with functional constraints

Abstract

Keywords:

1 Introduction

1.1 Problem formulation

Assumption 1

2 Parallel random minibatch subgradient algorithm

Assumption 2

2.1 Preliminary results

Lemma 1

Proof

Lemma 2

Proof

Remark 1

Lemma 3

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

2.2 Convergence rates

Theorem 2.1

Proof

Theorem 2.2

Proof

Remark 2

Remark 3

2.3 Example of functional constraints having LN<1L_{N}<1LN​<1

3 Sequential random minibatch subgradient algorithm

Lemma 7

Proof

Theorem 3.1

Proof

4 Extensions

5 Preliminary numerical results

6 Conclusions

Acknowledgements.

2.3 Example of functional constraints having $L_{N}<1$