A Fully Stochastic Primal-Dual Algorithm

Pascal Bianchi; Walid Hachem; Adil Salim

arXiv:1901.08170·math.OC·June 23, 2020·Optim. Lett.

A Fully Stochastic Primal-Dual Algorithm

Pascal Bianchi, Walid Hachem, Adil Salim

PDF

TL;DR

This paper introduces a novel stochastic primal-dual algorithm designed for composite optimization problems where functions are given as unknown statistical expectations, with proven convergence to a saddle point.

Contribution

It presents a fully stochastic primal-dual method with convergence guarantees, extending the stochastic Forward Backward algorithm to new composite optimization settings.

Findings

01

Proven convergence to saddle points under stochastic conditions

02

Applicable to convex optimization with stochastic linear constraints

03

Utilizes recent advances in stochastic monotone operator theory

Abstract

A new stochastic primal--dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions/operators that enter the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d. realizations. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In the framework of the monotone operator theory, the convergence proof relies on recent results on the stochastic Forward Backward algorithm involving random monotone operators. An example of convex optimization under stochastic linear constraints is considered.

Equations57

x \in X min F (x) + G (x) + H (L x)

x \in X min F (x) + G (x) + H (L x)

S_{\partial h (\cdot, x)}^{r} := {φ \in L^{r} (μ) : φ (s) \in \partial h (s, x) μ - almost everywhere (a.e.)}

S_{\partial h (\cdot, x)}^{r} := {φ \in L^{r} (μ) : φ (s) \in \partial h (s, x) μ - almost everywhere (a.e.)}

E_{μ} \partial h (\cdot, x) := cl {\int_{Ξ} φ d μ : φ \in S_{\partial h (\cdot, x)}^{1}}

E_{μ} \partial h (\cdot, x) := cl {\int_{Ξ} φ d μ : φ \in S_{\partial h (\cdot, x)}^{1}}

\int_{{s : g (s, x) \in [0, \infty)}} g (s, x) μ (d s) + \int_{{s : g (s, x) \in] - \infty, 0 [}} g (s, x) μ (d s) + I (x),

\int_{{s : g (s, x) \in [0, \infty)}} g (s, x) μ (d s) + \int_{{s : g (s, x) \in] - \infty, 0 [}} g (s, x) μ (d s) + I (x),

I(x)=\left\{\begin{array}[]{cl}+\infty,&\text{if }\mu(\{s:g(s,x)=\infty\})>0,\\ 0,&\text{otherwise}\,,\end{array}\right.

I(x)=\left\{\begin{array}[]{cl}+\infty,&\text{if }\mu(\{s:g(s,x)=\infty\})>0,\\ 0,&\text{otherwise}\,,\end{array}\right.

\left\{\begin{array}[h]{lccl}0&\in&\partial{\mathsf{F}}(x)+\partial{\mathsf{G}}(x)&+{\mathsf{L}}^{T}\lambda,\\ 0&\in&-{\mathsf{L}}x&+\partial{\mathsf{H}}^{\star}(\lambda)\,.\end{array}\right.

\left\{\begin{array}[h]{lccl}0&\in&\partial{\mathsf{F}}(x)+\partial{\mathsf{G}}(x)&+{\mathsf{L}}^{T}\lambda,\\ 0&\in&-{\mathsf{L}}x&+\partial{\mathsf{H}}^{\star}(\lambda)\,.\end{array}\right.

x_{n + 1} λ_{n + 1} = prox_{γ_{n + 1} g (ξ_{n + 1}, \cdot)} (x_{n} - γ_{n + 1} (\nabla f (ξ_{n + 1}, x_{n}) + L (ξ_{n + 1})^{T} λ_{n})), = prox_{γ_{n + 1} p (ξ_{n + 1}, \cdot)} (λ_{n} + γ_{n + 1} L (ξ_{n + 1}) x_{n}) .

x_{n + 1} λ_{n + 1} = prox_{γ_{n + 1} g (ξ_{n + 1}, \cdot)} (x_{n} - γ_{n + 1} (\nabla f (ξ_{n + 1}, x_{n}) + L (ξ_{n + 1})^{T} λ_{n})), = prox_{γ_{n + 1} p (ξ_{n + 1}, \cdot)} (λ_{n} + γ_{n + 1} L (ξ_{n + 1}) x_{n}) .

\overset{x}{ˉ}_{n} = \frac{\sum _{k = 1}^{n} γ _{k} x _{k}}{\sum _{k = 1}^{n} γ _{k}}, and \overset{ˉ}{λ}_{n} = \frac{\sum _{k = 1}^{n} γ _{k} λ _{k}}{\sum _{k = 1}^{n} γ _{k}} .

\overset{x}{ˉ}_{n} = \frac{\sum _{k = 1}^{n} γ _{k} x _{k}}{\sum _{k = 1}^{n} γ _{k}}, and \overset{ˉ}{λ}_{n} = \frac{\sum _{k = 1}^{n} γ _{k} λ _{k}}{\sum _{k = 1}^{n} γ _{k}} .

E_{μ} φ_{f} + E_{μ} φ_{g} + L^{T} λ_{⋆} = 0, and - L x_{⋆} + E_{μ} φ_{p} = 0.

E_{μ} φ_{f} + E_{μ} φ_{g} + L^{T} λ_{⋆} = 0, and - L x_{⋆} + E_{μ} φ_{p} = 0.

x \in K sup E_{μ} ∥ \partial_{0} g (\cdot, x) ∥^{1 + ε} < + \infty, and E_{μ} ∥ \partial_{0} g (\cdot, x_{0}) ∥^{1 + 1/ ε} < + \infty.

x \in K sup E_{μ} ∥ \partial_{0} g (\cdot, x) ∥^{1 + ε} < + \infty, and E_{μ} ∥ \partial_{0} g (\cdot, x_{0}) ∥^{1 + 1/ ε} < + \infty.

∥ \nabla f (s, x) ∥ \leq β (s) (1 + ∥ x ∥) .

∥ \nabla f (s, x) ∥ \leq β (s) (1 + ∥ x ∥) .

E_{μ} dist (x, D_{\partial g} (\cdot))^{2} \geq C dist (x, dom \partial G)^{2} .

E_{μ} dist (x, D_{\partial g} (\cdot))^{2} \geq C dist (x, dom \partial G)^{2} .

\int ∥ prox_{γ g (s, \cdot)} (x) - Π_{g} (s, x) ∥^{4} μ (d s) \leq C γ^{4} (1 + ∥ x ∥^{2 m}),

\int ∥ prox_{γ g (s, \cdot)} (x) - Π_{g} (s, x) ∥^{4} μ (d s) \leq C γ^{4} (1 + ∥ x ∥^{2 m}),

g (s, x) = ι_{C_{s}} (x) for (s, x) \in Ξ \times X .

g (s, x) = ι_{C_{s}} (x) for (s, x) \in Ξ \times X .

g(s,x)=\left\{\begin{array}[]{ll}\alpha(0)^{-1}h(u,x)&\text{if }k=0,\\ \iota_{{\mathcal{C}}_{k}}(x)&\text{otherwise},\end{array}\right.

g(s,x)=\left\{\begin{array}[]{ll}\alpha(0)^{-1}h(u,x)&\text{if }k=0,\\ \iota_{{\mathcal{C}}_{k}}(x)&\text{otherwise},\end{array}\right.

G (x) = \frac{1}{α ( 0 )} \int_{Σ} h (u, x) ν (d u) + ι_{C} (x),

G (x) = \frac{1}{α ( 0 )} \int_{Σ} h (u, x) ν (d u) + ι_{C} (x),

\partial G (x) = E_{μ} \partial g (\cdot, x) = \frac{1}{α ( 0 )} E_{ν} \partial h (\cdot, x) + k = 1 \sum m N_{C_{k}} (x) .

\partial G (x) = E_{μ} \partial g (\cdot, x) = \frac{1}{α ( 0 )} E_{ν} \partial h (\cdot, x) + k = 1 \sum m N_{C_{k}} (x) .

i = 1 \dots m max dist (x, C_{i}) \geq C dist (x, C) .

i = 1 \dots m max dist (x, C_{i}) \geq C dist (x, C) .

∥ \partial g_{0} (s, x) ∥ \leq β (s) (1 + ∥ x ∥^{m /2}),

∥ \partial g_{0} (s, x) ∥ \leq β (s) (1 + ∥ x ∥^{m /2}),

x \in X min F (x) + G (x) s.t. L x = c .

x \in X min F (x) + G (x) s.t. L x = c .

x_{n + 1}

x_{n + 1}

λ_{n + 1}

x_{n + 1} = (I + γ A)^{- 1} (x_{n} - γ B (x_{n}))

x_{n + 1} = (I + γ A)^{- 1} (x_{n} - γ B (x_{n}))

A (s, (x, λ)) = [\partial g (s, x) \partial p (s, λ)],

A (s, (x, λ)) = [\partial g (s, x) \partial p (s, λ)],

B (s, (x, λ)) = [\partial f (s, x) - L (s) x + L (s)^{T} λ] .

B (s, (x, λ)) = [\partial f (s, x) - L (s) x + L (s)^{T} λ] .

B_{1} (s, (x, λ)) = [\partial f (s, x) 0], and B_{2} (s) = [0 - L (s) L (s)^{T} 0]

B_{1} (s, (x, λ)) = [\partial f (s, x) 0], and B_{2} (s) = [0 - L (s) L (s)^{T} 0]

A ((x, λ)) = [\partial G (x) \partial H^{⋆} (λ)], and B ((x, λ)) = [\partial F (x) - L x + L^{T} λ],

A ((x, λ)) = [\partial G (x) \partial H^{⋆} (λ)], and B ((x, λ)) = [\partial F (x) - L x + L^{T} λ],

b (s, (x, λ)) = [\nabla f (s, x) - L (s) x + L (s)^{T} λ]

b (s, (x, λ)) = [\nabla f (s, x) - L (s) x + L (s)^{T} λ]

(x_{n + 1}, λ_{n + 1}) = (I + γ_{n + 1} A (ξ_{n + 1}, \cdot))^{- 1} ((x_{n}, λ_{n}) - γ_{n + 1} b (ξ_{n + 1}, (x_{n}, λ_{n}))) .

(x_{n + 1}, λ_{n + 1}) = (I + γ_{n + 1} A (ξ_{n + 1}, \cdot))^{- 1} ((x_{n}, λ_{n}) - γ_{n + 1} b (ξ_{n + 1}, (x_{n}, λ_{n}))) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Fully Stochastic Primal-Dual Algorithm

Pascal Bianchi

LTCI, Télécom Paris, IP Paris, 75013, Paris, France.

Walid Hachem

LIGM, CNRS, Univ. Gustave Eiffel, F-77454 Marne-la-Vallée, France

Adil Salim

Visual Computing Center, KAUST, Saudi Arabia.

(27 January 2020)

Abstract

A new stochastic primal-dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions / operators that enter the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d realizations. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In the framework of the monotone operator theory, the convergence proof relies on recent results on the stochastic Forward Backward algorithm involving random monotone operators. An example of convex optimization under stochastic linear constraints is considered.

1 Introduction

Many applications in machine learning, statistics or signal processing require the solution of the following optimization problem. Given two Euclidean spaces ${\mathcal{X}}$ and ${\mathcal{V}}$ , solve

[TABLE]

where ${\mathsf{F}},{\mathsf{G}}$ and ${\mathsf{H}}$ are lower semicontinuous convex functions such that ${\mathsf{F}}(x)<\infty$ for every $x$ and ${\mathsf{L}}$ belongs to the set ${{\mathcal{L}}}({\mathcal{X}},{\mathcal{V}})$ of ${\mathcal{X}}\to{\mathcal{V}}$ linear operators.

Assuming the truth of the qualification condition $0\in\operatorname{ri}(\operatorname{dom}{\mathsf{H}}-{\mathsf{L}}\operatorname{dom}{\mathsf{G}})$ , where $\operatorname{dom}$ is the domain of a function and $\operatorname{ri}$ is the relative interior of a set, primal-dual methods generate a sequence of primal estimates $(x_{n})_{n\in{{\mathbb{N}}}}$ and a sequence of dual estimates $(\lambda_{n})_{n\in{{\mathbb{N}}}}$ jointly converging to a saddle point of the Lagrangian function $(x,\lambda)\mapsto{\mathsf{F}}(x)+{\mathsf{G}}(x)-{\mathsf{H}}^{\star}(\lambda)+\langle{\mathsf{L}}x,\lambda\rangle$ , where ${\mathsf{H}}^{\star}$ is the Fenchel conjugate of ${\mathsf{H}}$ . There is a rich literature on such algorithms which cannot be exhaustively listed [10, 22, 14].

In this paper, it is assumed that the quantities that enter the minimization problem are unavailable or difficult to compute numerically, and have to be replaced with random quantities. Specifically, let $(\Xi,{\mathscr{G}},\mu)$ be a probability space, and let $f:\Xi\times{\mathcal{X}}\rightarrow{{\mathbb{R}}}$ and $g:\Xi\times{\mathcal{V}}\rightarrow(-\infty,+\infty]$ be two convex normal integrands (see below). Assume that ${\mathsf{F}}(x)={{\mathbb{E}}}_{\mu}(f(\cdot,x))$ and ${\mathsf{G}}(x)={{\mathbb{E}}}_{\mu}(g(\cdot,x))$ . In addition, let $L$ be a measurable function from $(\Xi,{\mathscr{G}},\mu)$ to ${{\mathcal{L}}}({\mathcal{X}},{\mathcal{V}})$ (i.e a random matrix), and assume that ${\mathsf{L}}={{\mathbb{E}}}_{\mu}L(\cdot)$ . Finally, assume that ${\mathsf{H}}^{\star}$ takes the form ${\mathsf{H}}^{\star}(\lambda)={{\mathbb{E}}}_{\mu}(p(\cdot,\lambda))$ , where $p$ is a normal convex integrand. In order to solve Problem (1), no one of the objects ${\mathsf{F}}$ , ${\mathsf{G}}$ , ${\mathsf{H}}$ and ${\mathsf{L}}$ is available. Instead, the observer is given the functions $f$ , $g$ , $p$ , and $L$ , along with a sequence of independent and identically distributed (i.i.d.) random variables $(\xi_{n})$ with the probability distribution $\mu$ . In this paper, a new stochastic primal dual algorithm based on this data is proposed to solve this problem. The convergence proof for this algorithm relies on the monotone operator theory. The algorithm is built around an instantiation of the stochastic Forward-Backward (FB) algorithm involving random monotone operators that was introduced in [6]. It is proven that the weighted means of the iterates of the algorithm, where the weights are given by the step sizes of the algorithm, converges almost surely to a saddle point of the Lagrangian function.

To our knowledge, the proposed algorithm is the first method that allows to solve Problem (1) in a fully stochastic setting with weak assumptions on the noise. Existing methods typically allow to handle subproblems of Problem (1) in which some quantities used in this problem are assumed to be available or set to zero [16, 20, 21, 23]. In particular, the new algorithm generalizes the stochastic gradient algorithm, the stochastic proximal point algorithm [17, 21, 5], and the stochastic proximal gradient algorithm [1, 8]. A close paper to ours is [11], which deals with a FB algorithm with deterministic monotone operators and random additive errors. In this reference, the convergence of the iterates is established under stringent summability conditions on these errors. Random block coordinate iterations combined with the FB algorithm were also considered in [13, 7, 12].

The next section is devoted to rigorously stating the problem and the main result. An application example is also considered. Section 3 is devoted to the proof of our main theorem.

Some notations.

The notation ${\mathscr{B}}({\mathcal{X}})$ will refer to the Borel $\sigma$ -field of ${\mathcal{X}}$ . Both the operator norm and the Euclidean vector norm will be denoted as $\|\cdot\|$ . The distance of a point $x$ to a set $S$ is denoted as $\operatorname{dist}(x,S)$ . As mentioned above, we denote as ${{\mathcal{L}}}({\mathcal{X}},{\mathcal{V}})$ the set of linear operators, identified with matrices, from ${\mathcal{X}}$ to ${\mathcal{V}}$ . The set of proper, lower semicontinuous convex functions on ${\mathcal{X}}$ is $\Gamma_{0}({\mathcal{X}})$ . The set of real-valued $k$ –summable sequences is $\ell^{k}$ .

2 Problem description and main result

We start by recalling some mathematical definitions. Let $(\Xi,{\mathscr{G}},\mu)$ be a probability space where the $\sigma$ -field ${\mathscr{G}}$ is $\mu$ -complete, and let ${\mathcal{X}}$ be an Euclidean space. A function $h:\Xi\times{\mathcal{X}}\to(-\infty,\infty]$ is said a convex normal integrand [19] if $h(s,\cdot)$ is convex, and if the set-valued mapping $s\mapsto\operatorname{epi}h(s,\cdot)$ is closed-valued and measurable in the sense of [19, Chap. 14], where $\operatorname{epi}$ is the epigraph of a function. We shall always assume that $h(s,\cdot)\in\Gamma_{0}({\mathcal{X}})$ for $\mu$ –almost all $s\in\Xi$ . Given $x\in{\mathcal{X}}$ , denote as $\partial h(s,x)$ the subdifferential of $h(s,\cdot)$ at $x$ . For $r\in[1,\infty)$ , let ${\mathscr{L}}^{r}(\mu)$ be the space of the ${{\mathscr{G}}}$ -measurable functions $\varphi:\Xi\to{\mathcal{X}}$ such that $\int\|\varphi\|^{r}d\mu<\infty$ . If $\mu(\{s\in\Xi\,:\,\partial h(s,x)\neq\emptyset\})<1$ , set ${\mathfrak{S}}^{r}_{\partial h(\cdot,x)}:=\emptyset$ , otherwise,

[TABLE]

is the set of the so-called $r$ –integrable selections of the measurable set-valued function $s\mapsto\partial h(s,x)$ . Denoting as $\operatorname{cl}$ the closure of a set, the so-called selection integral of $\partial h(\cdot,x)$ is the set

[TABLE]

that might be empty. Note that we use the same notation ${{\mathbb{E}}}_{\mu}$ for these set-valued expectations and for the classical single-valued expectations.

We now state our problem. Let $f:\Xi\times{\mathcal{X}}\to(-\infty,\infty]$ be a convex normal integrand, assume that ${{\mathbb{E}}}_{\mu}|f(\cdot,x)|<\infty$ for all $x\in{\mathcal{X}}$ , and consider the convex function ${\mathsf{F}}(x):={{\mathbb{E}}}_{\mu}f(\cdot,x)$ which domain is ${\mathcal{X}}$ . Let $g:\Xi\times{\mathcal{X}}\to(-\infty,\infty]$ be a convex normal integrand, and let ${\mathsf{G}}(x):={{\mathbb{E}}}_{\mu}g(\cdot,x)$ , where the integral ${{\mathbb{E}}}_{\mu}$ is defined as the sum

[TABLE]

and

[TABLE]

and where the convention $(+\infty)+(-\infty)=+\infty$ is used. The function ${\mathsf{G}}$ is a lower semi continuous convex function if ${\mathsf{G}}(x)>-\infty$ for all $x$ , which we assume. We shall assume that ${\mathsf{G}}$ is proper. In a similar manner, let $p:\Xi\times{\mathcal{V}}\to(-\infty,\infty]$ be a convex normal integrand, assume that ${\mathsf{P}}:\lambda\mapsto{{\mathbb{E}}}_{\mu}p(\cdot,\lambda)$ belongs to $\Gamma_{0}({\mathcal{V}})$ , and let ${\mathsf{H}}$ be its Fenchel conjugate (thus, ${\mathsf{H}}^{\star}={\mathsf{P}}$ ). Finally, let $L:\Xi\to{{\mathcal{L}}}({\mathcal{X}},{\mathcal{V}})$ be an operator-valued measurable function, assume that $\|L\|$ is $\mu$ -integrable, and let ${\mathsf{L}}:={{\mathbb{E}}}_{\mu}L$ .

Having introduced these functions, our purpose is to find a solution $x\in{\mathcal{X}}$ of Problem (1), where the set of such points is assumed non empty. To solve this problem, the observer is given the functions $f,g,p,L$ , and a sequence of i.i.d random variables $(\xi_{n})_{n\in{{\mathbb{N}}}}$ from a probability space $(\Omega,{\mathscr{F}},{{\mathbb{P}}})$ to $(\Xi,{\mathscr{G}})$ with the probability distribution $\mu$ .

Denote as $\operatorname{prox}_{h}(x):=\arg\min_{y\in{\mathcal{X}}}h(y)+\|y-x\|^{2}/2$ the Moreau’s proximity operator of a function $h\in\Gamma_{0}({\mathcal{X}})$ . We also denote as $\partial_{0}h(x)$ the least norm element of the set $\partial h(x)$ , which is known to exist and to be unique [4]. Similarly, $\partial_{0}f(s,x)$ will refer to the least norm element of $\partial f(s,x)$ which was introduced above. We shall also denote as ${{\widetilde{\nabla}}}f(s,x)$ a measurable subgradient of $f(s,\cdot)$ at $x$ . Specifically, ${{\widetilde{\nabla}}}f:(\Xi\times{\mathcal{X}},{\mathscr{G}}\otimes{\mathscr{B}}({\mathcal{X}}))\to({\mathcal{X}},{\mathscr{B}}({\mathcal{X}}))$ is a measurable function such that for each $x\in{\mathcal{X}}$ , ${{\widetilde{\nabla}}}f(\cdot,x)\in{\mathfrak{S}}^{1}_{\partial f(\cdot,x)}$ , which is known to be non empty thanks to the integrability assumption ${{\mathbb{E}}}_{\mu}|f(\cdot,x)|<\infty$ [18]. A possible choice for ${{\widetilde{\nabla}}}f(s,x)$ is $\partial_{0}f(s,x)$ [6, §2.3 and §3.1]. Turning back to Problem (1), our purpose will be to find a saddle point of the Lagrangian $(x,\lambda)\mapsto{\mathsf{F}}(x)+{\mathsf{G}}(x)-{\mathsf{H}}^{\star}(\lambda)+\langle{\mathsf{L}}x,\lambda\rangle$ . Denoting as ${{\mathcal{S}}}\subset{\mathcal{X}}\times{\mathcal{V}}$ the set of these saddle points, an element $(x,\lambda)$ of ${{\mathcal{S}}}$ is characterized by the inclusions

[TABLE]

Consider a sequence of positive weights $(\gamma_{n})_{n\in{{\mathbb{N}}}}$ . The algorithm proposed here consists in the following iterations applied to the random vector $(x_{n},\lambda_{n})\in{\mathcal{X}}\times{\mathcal{V}}$ .

[TABLE]

The convergence of Algorithm (4) is stated by the next theorem in terms of weighted averaged estimates

[TABLE]

Theorem 2.1

Consider Problem (1), and let the following assumptions hold.

The step size sequence satisfies $(\gamma_{n})\in\ell^{2}\setminus\ell^{1}$ , and $\gamma_{n+1}/\gamma_{n}\rightarrow 1$ as $n\to\infty$ . 2. 2.

The function ${\mathsf{G}}$ satisfies $\partial{\mathsf{G}}(x)={{\mathbb{E}}}_{\mu}\partial g(\cdot,x)$ for each $x\in{\mathcal{X}}$ . 3. 3.

There exists an integer $m\geq 2$ that satisfies the following conditions:

•

The function $L$ is in ${\mathscr{L}}^{2m}(\mu)$ .

•

There exists a point $(x_{\star},\lambda_{\star})\in{{\mathcal{S}}}$ , and three functions $\varphi_{f}\in{\mathfrak{S}}^{2m}_{\partial f(\cdot,x_{\star})}$ , $\varphi_{g}\in{\mathfrak{S}}^{2m}_{\partial g(\cdot,x_{\star})}$ , and $\varphi_{p}\in{\mathfrak{S}}^{2m}_{\partial p(\cdot,\lambda_{\star})}$ such that

[TABLE]

Moreover, for every point $(x_{\star},\lambda_{\star})\in{{\mathcal{S}}}$ , there exist three functions $\varphi_{f}\in{\mathfrak{S}}^{2}_{\partial f(\cdot,x_{\star})}$ , $\varphi_{g}\in{\mathfrak{S}}^{2}_{\partial g(\cdot,x_{\star})}$ , and $\varphi_{p}\in{\mathfrak{S}}^{2}_{\partial p(\cdot,\lambda_{\star})}$ such that (5) holds. 4. 4.

For any compact set $K\subset\operatorname{dom}\partial{\mathsf{G}}$ , there exist $\varepsilon\in(0,1]$ and $x_{0}\in\operatorname{dom}\partial{\mathsf{G}}$ such that

[TABLE] 5. 5.

There exists a measurable $\Xi\to{{\mathbb{R}}}_{+}$ function $\beta$ such that $\beta^{2m}$ is $\mu$ -integrable, where $m$ is the integer provided by Assumption 3, and such that for all $x\in{\mathcal{X}}$ ,

[TABLE]

*Moreover, there exists a constant $C>0$ such that ${{\mathbb{E}}}_{\mu}\|{{\widetilde{\nabla}}}f(\cdot,x)\|^{4}\leq C(1+\|x\|^{2m})$ . * 6. 6.

Writing $D_{\partial g}(s)=\operatorname{dom}\partial g(s,\cdot)$ , there exists $C>0$ such that for all $x\in{\mathcal{X}}$ ,

[TABLE] 7. 7.

There exists $C>0$ such that for any $x\in{\mathcal{X}}$ and any $\gamma>0$ ,

[TABLE]

where $\Pi_{g}(s,\cdot)$ is the projection operator onto $\operatorname{cl}(\operatorname{dom}\partial g(s,\cdot))$ , and where $m$ is the integer provided by Assumption 3. 8. 8.

Assumptions 2, 4, 6 and 7 hold true when the function $g$ is replaced with $p$ and the space ${\mathcal{X}}$ is replaced with ${\mathcal{V}}$ .

Then, the sequence $(x_{n},\lambda_{n})$ is bounded in ${\mathscr{L}}^{2m}(\Omega)$ and the sequence $(\bar{x}_{n},\bar{\lambda}_{n})$ converges almost surely (a.s.) to a random variable $(X,\Lambda)$ supported by ${{\mathcal{S}}}$ .

Let us now discuss our assumptions. Assumption 1 is standard in the decreasing step case. Assumption 2 requires that the interchange of the expectation ${{\mathbb{E}}}_{\mu}g(\cdot,x)$ and the subdifferentiation be possible. Let us provide some sufficient conditions for this to be true. By [18], this will be the case if the following conditions hold: i) the set-valued mapping $s\mapsto\operatorname{cl}\operatorname{dom}g(s,\cdot)$ is constant $\mu$ -a.e., where $\operatorname{dom}g(s,\cdot)$ is the domain of $g(s,\cdot)$ , ii) ${\mathsf{G}}(x)<\infty$ whenever $x\in\operatorname{dom}g(s,\cdot)$ $\mu$ -a.e., iii) there exists $x_{0}\in{\mathcal{X}}$ at which ${\mathsf{G}}$ is finite and continuous. Another case of practical importance where this interchange is permitted is the following. Let $m$ be a positive integer, and let ${\mathcal{C}}_{1},\ldots{\mathcal{C}}_{m}$ be a collection of closed and convex subsets of ${\mathcal{X}}$ . Let ${\mathcal{C}}=\cap_{i=k}^{m}{\mathcal{C}}_{k}$ be non empty, and assume that the normal cone $N_{{\mathcal{C}}}(x)$ of ${\mathcal{C}}$ at $x$ satisfies the identity $N_{{\mathcal{C}}}(x)=\sum_{k=1}^{m}N_{{\mathcal{C}}_{k}}(x)$ for each $x\in{\mathcal{X}}$ , where the summation is the usual set summation. As is well known, this identity holds true under a qualification condition of the type $\cap_{k=1}^{m}\operatorname{ri}{\mathcal{C}}_{k}\neq\emptyset$ (see also [3] for other conditions). Now, assume that $\Xi=\{1,\ldots,m\}$ and that $\mu$ is an arbitrary probability measure putting a positive weight on each $\{k\}\subset\Xi$ . Let $g(s,x)$ be the indicator function

[TABLE]

Then it is obvious that $g$ is a convex normal integrand, ${\mathsf{G}}=\iota_{{\mathcal{C}}}$ , and $\partial{\mathsf{G}}(x)={{\mathbb{E}}}_{\mu}\partial g(\cdot,x)$ . We can also combine these two types of conditions: let $(\Sigma,{\mathscr{T}},\nu)$ be a probability space, where ${\mathscr{T}}$ is $\nu$ -complete, and let $h:\Sigma\times{\mathcal{X}}\to(-\infty,\infty]$ be a convex normal integrand satisfying the conditions i)–iii) above. Consider the closed and convex sets ${\mathcal{C}}_{1},\ldots,{\mathcal{C}}_{m}$ introduced above, and let $\alpha$ be a probability measure on the set $\{0,\ldots,m\}$ such that $\alpha(\{k\})>0$ for each $k\in\{0,\ldots,m\}$ . Now, set $\Xi=\Sigma\times\{0,\ldots,m\}$ , $\mu=\nu\otimes\alpha$ , and define $g:\Xi\times{\mathcal{X}}\to(-\infty,\infty]$ as

[TABLE]

where $s=(u,k)\in\Sigma\times\{0,\ldots,m\}$ . Then it is clear that

[TABLE]

and

[TABLE]

Assumption 3 is a moment assumption that is generally easy to check. Note that this assumption requires the set of saddle points ${{\mathcal{S}}}$ to be non empty. Notice the relation between Equations (5) and the two inclusions in (3). Focusing on the first inclusion and using Assumption 2, there exist $a\in\partial{\mathsf{F}}(x_{\star})={{\mathbb{E}}}_{\mu}\partial f(\cdot,x_{\star})$ and $b\in\partial{\mathsf{G}}(x_{\star})={{\mathbb{E}}}_{\mu}\partial g(\cdot,x_{\star})$ such that $0=a+b+{\mathsf{L}}^{T}\lambda_{\star}$ . Then, Assumption 3 states that $a$ and $b$ can be taken in such a way that there are two measurable selections $\varphi_{f}$ and $\varphi_{g}$ of $\partial f(\cdot,x_{\star})$ and $\partial g(\cdot,x_{\star})$ respectively which are both in ${\mathscr{L}}^{2m}(\mu)$ and which satisfy $a={{\mathbb{E}}}_{\mu}\varphi_{f}$ and $b={{\mathbb{E}}}_{\mu}\varphi_{g}$ . A sufficient condition for the existence of the selections satisfying Assumption 3 is the following [8]: there exists an open neighborhood $\mathcal{N}_{x}$ of $x_{\star}$ and an open neighborhood $\mathcal{N}_{\lambda}$ of $\lambda_{\star}$ such that $\forall x\in\mathcal{N}_{x}$ , $\int f(s,x)^{2m}\mu(ds)<\infty$ and $\int g(s,x)^{2m}\mu(ds)<\infty$ , and $\forall\lambda\in\mathcal{N}_{\lambda}$ , $\int p(s,x)^{2m}\mu(ds)<\infty$ . Note also that the larger is $m$ , and the weaker is Assumption 7.

Assumption 4 is relatively weak and easy to check. It is interesting to compare it with Assuption 5. It is indeed much weaker than the latter, which assumes that the growth of ${{\widetilde{\nabla}}}f(s,\cdot)$ is not faster than linear. This is due to the fact that $g$ and $p$ enter the algorithm (4) through the proximity operator while the function $f$ is used explicitly in this algorithm (through its (sub)gradient). This use of the functions $f$ is reminiscent of the well-known Robbins-Monro algorithm, where a linear growth is needed to ensure the algorithm stability. Note that Assumption 5 is satisfied under the more restrictive assumption that $\nabla f(s,\cdot)$ is $L$ -Lipschitz continuous without any bounded gradient assumption.

Assumption 6 is quite weak, and is studied e.g in [15]. This assumption is easy to illustrate in the case where $g(s,x)=\iota_{{\mathcal{C}}_{s}}(x)$ as in (6). Following [3], we say that the subsets $({\mathcal{C}}_{1},\dots,{\mathcal{C}}_{m})$ are linearly regular if there exists $C>0$ such that for every $x$ ,

[TABLE]

Sufficient conditions for a collection of sets to satisfy the above condition can be found in [3] and the references therein. Note that this condition implies that $N_{{\mathcal{C}}}(x)=\sum_{i=1}^{m}N_{{\mathcal{C}}_{i}}(x)$ . Let us finally discuss Assumption 7. As $\gamma\to 0$ , it is known that $\operatorname{prox}_{\gamma g(s,\cdot)}(x)$ converges to $\Pi_{g}(s,x)$ for every $(s,x)$ . Assumption 7 provides a control on the convergence rate. This assumption holds under the sufficient condition that for $\mu$ -almost every $s$ and for every $x\in\operatorname{dom}\partial g(s,\cdot)$ ,

[TABLE]

where $\beta$ is a positive random variable with a finite fourth moment [5].

We now consider an application example of Theorem 2.1.

Example 1

Let ${\mathsf{c}}\in{\mathcal{V}}$ . Setting ${\mathsf{H}}=\iota_{\{{\mathsf{c}}\}}$ , where $\iota_{{\mathcal{C}}}$ is the indicator function of the set ${\mathcal{C}}$ , Problem (1) boils down to the linearly constrained problem

[TABLE]

If we assume that ${\mathsf{c}}={{\mathbb{E}}}_{\mu}(c(\cdot))$ where $c(\cdot):\Xi\rightarrow{\mathcal{V}}$ is a random vector, then our problem amounts to randomizing the constraints and to handling these stochastic constraints online. Such a context is encountered in various fields of machine learning, as the Neyman-Pearson classification, or in online so-called Markowicz portfolio optimization.

Since ${\mathsf{H}}^{\star}(\lambda)=\langle\lambda,{\mathsf{c}}\rangle$ , we simply need to put $p(\cdot,\lambda)=\langle\lambda,c(\cdot)\rangle$ , and Algorithm (4) becomes:

[TABLE]

To go further, let us particularize Problem (7) to the case of the Markowicz portfolio optimization, and check the assumptions of Theorem 2.1 to complete the picture. In this case, $\xi$ is a ${\mathcal{X}}$ –valued random variable with a second moment, ${\mathsf{F}}(x)={{\mathbb{E}}}_{\mu}\langle x,\xi\rangle^{2}$ , ${\mathsf{G}}(x)=\iota_{\Delta}(x)$ where $\Delta$ is the probability simplex, ${\mathsf{L}}={{\mathbb{E}}}_{\mu}(\xi^{T})$ , and ${\mathsf{c}}$ is some real positive number. Note that it is usually assumed that ${\mathsf{L}}={{\mathbb{E}}}_{\mu}(\xi^{T})$ is fully known or estimated, which we don’t do here. We of course assume that the qualification condition ${\mathsf{c}}\in\operatorname{ri}{\mathsf{L}}\Delta$ holds true.

Assumptions 2 and 4 of the statement of Theorem 2.1 are immediate for both $g$ and $p$ . One can check that Assumption 3 is satisfied for $m=2$ if we assume that ${{\mathbb{E}}}_{\mu}\|\xi\|^{4}<\infty$ , which also ensures the truth of Assumption 5. Assumptions 6 and 7 are trivially satisfied for $g$ and $p$ , since $\operatorname{prox}_{\gamma g(s,\cdot)}=\Pi_{g}(s,\cdot)$ , and since $p(s,\cdot)$ has a full domain.

3 Proof of Theorem 2.1

The proof of Theorem 2.1 makes use of the monotone operator theory. We begin by recalling some basic facts on monotone operators. All the results below can be found in [9, 4] without further mention.

A set-valued mapping ${\mathsf{A}}:{\mathcal{X}}\rightrightarrows{\mathcal{X}}$ on the Euclidean space ${\mathcal{X}}$ will be called herein an operator. An operator with singleton values is identified with a function. As above, the domain of ${\mathsf{A}}$ is $\operatorname{dom}({\mathsf{A}})=\{x\in{\mathcal{X}}\,:\,{\mathsf{A}}(x)\neq\emptyset\}$ . The graph of ${\mathsf{A}}$ is $\operatorname{gr}({\mathsf{A}})=\{(x,y)\in{\mathcal{X}}\times{\mathcal{X}}\,:\,y\in{\mathsf{A}}(x)\}$ . The operator ${\mathsf{A}}$ is said monotone if $\forall(x,y),(x^{\prime},y^{\prime})\in\operatorname{gr}({\mathsf{A}})$ , $\langle y-y^{\prime},x-x^{\prime}\rangle\geq 0$ . A monotone operator with non empty domain is said maximal if $\operatorname{gr}({\mathsf{A}})$ is a maximal element for the inclusion ordering in the family of the monotone operator graphs. Let $I$ be the identity operator, and let ${\mathsf{A}}^{-1}$ be the inverse of ${\mathsf{A}}$ , which is defined by the fact that $(x,y)\in\operatorname{gr}({\mathsf{A}}^{-1})\Leftrightarrow(y,x)\in\operatorname{gr}({\mathsf{A}})$ . An operator ${\mathsf{A}}$ belongs to the set ${\mathscr{M}}({\mathcal{X}})$ of the maximal monotone operators on ${\mathcal{X}}$ if and only if for each $\gamma>0$ , the so-called resolvent $(I+\gamma{\mathsf{A}})^{-1}$ is a contraction defined on the whole space ${\mathcal{X}}$ . In particular, it is single-valued. A typical element of ${\mathscr{M}}({\mathcal{X}})$ is the subdifferential $\partial h$ of a function $h\in\Gamma_{0}({\mathcal{X}})$ . In this case, the resolvent $(I+\gamma\partial h)^{-1}$ for $\gamma>0$ coincides with the proximity operator $\operatorname{prox}_{\gamma h}$ . A skew-symmetric element of ${{\mathcal{L}}}({\mathcal{X}},{\mathcal{X}})$ can also be checked to be an element of ${\mathscr{M}}({\mathcal{X}})$ .

The set of zeros of an operator ${\mathsf{A}}$ on ${\mathcal{X}}$ is the set $Z({\mathsf{A}})=\{x\in{\mathcal{X}}\,:\,0\in{\mathsf{A}}(x)\}$ . The sum of two operators ${\mathsf{A}}$ and ${\mathsf{B}}$ is the operator ${\mathsf{A}}+{\mathsf{B}}$ whose image at $x$ is the set sum of ${\mathsf{A}}(x)$ and ${\mathsf{B}}(x)$ . Given two operators ${\mathsf{A}},{\mathsf{B}}\in{\mathscr{M}}({\mathcal{X}})$ , where ${\mathsf{B}}$ is single-valued with domain ${\mathcal{X}}$ , the FB algorithm is an iterative algorithm for finding a point in $Z({\mathsf{A}}+{\mathsf{B}})$ . It reads

[TABLE]

where $\gamma$ is a positive step.

In the sequel, we shall be interested by random elements of ${\mathscr{M}}({\mathcal{X}})$ as used in [5, 6, 8]. A random element of ${\mathscr{M}}({\mathcal{X}})$ is a measurable function $M:\Xi\to{\mathscr{M}}({\mathcal{X}})$ in the sense of [2], where $(\Xi,{\mathscr{G}},\mu)$ is the probability space introduced at the beginning of Section 2. In particular, when $h:\Xi\times{\mathcal{X}}\to(-\infty,\infty]$ is a convex normal integrand such as $h(s,\cdot)$ is proper $\mu$ -a.e., $M(s)=\partial h(s,\cdot)$ is a random element of ${\mathscr{M}}({\mathcal{X}})$ . Moreover, when $M(s)$ is a skew-symmetric element of ${{\mathcal{L}}}({\mathcal{X}},{\mathcal{X}})$ which is measurable in the usual sense (as a $\Xi\to{{\mathcal{L}}}({\mathcal{X}},{\mathcal{X}})$ function), then it is also a random element of ${\mathscr{M}}({\mathcal{X}})$ . If we fix $x\in{\mathcal{X}}$ and we denote as $M(s,x)$ its image by $M(s)$ , then the set-valued function $s\mapsto M(s,x)$ is measurable, and its (set-valued) expectation ${\mathsf{M}}(x)={{\mathbb{E}}}_{\mu}M(\cdot,x)$ is defined similarly to Equation (2) [2, 5, 6]. Note that ${\mathsf{M}}$ is monotone but not necessarily maximal.

We now enter the proof of Theorem 2.1. Let us set ${\mathcal{Y}}={\mathcal{X}}\times{\mathcal{V}}$ , and endow this Euclidean space with the standard scalar product. By writing $(x,\lambda)\in{\mathcal{Y}}$ , it will be understood that $x\in{\mathcal{X}}$ and $\lambda\in{\mathcal{V}}$ . For each $s\in\Xi$ , define the set-valued operator $A(s)$ on ${\mathcal{Y}}$ as the operator that takes $(x,\lambda)$ to

[TABLE]

Fixing $s\in\Xi$ , the operator $A(s,(x,\lambda))$ coincides with the subdifferential of the convex normal integrand $g(s,x)+p(s,\lambda)$ with respect to $(x,\lambda)$ . Thus, $A(s)$ is a random element of ${\mathscr{M}}({\mathcal{Y}})$ . Let us also define the operator $B(s)$ as

[TABLE]

We can write $B(s)=B_{1}(s)+B_{2}(s)$ , where

[TABLE]

( $B_{2}(s)$ is a linear skew-symmetric operator written in a matrix form in ${\mathcal{Y}}$ ). For each $s\in\Xi$ , both these operators belong to ${\mathscr{M}}({\mathcal{Y}})$ , and $\operatorname{dom}B_{2}(s)={\mathcal{Y}}$ . Thus, $B(s)\in{\mathscr{M}}({\mathcal{Y}})$ by [4, Cor. 24.4]. Moreover, since both $B_{1}$ and $B_{2}$ are measurable, $B$ is a random element of ${\mathscr{M}}({\mathcal{Y}})$ .

Since $f(\cdot,x)$ is Lebesgue-integrable for all $x\in{\mathcal{X}}$ by construction, it is known that $\partial{\mathsf{F}}(x)={{\mathbb{E}}}_{\mu}\partial f(\cdot,x)$ [18]. Moreover, $\partial{\mathsf{G}}(x)={{\mathbb{E}}}_{\mu}\partial g(\cdot,x)$ and $\partial{\mathsf{H}}^{\star}(\lambda)={{\mathbb{E}}}_{\mu}\partial p(\cdot,\lambda)$ by Assumptions 2 and 8. Thus, the operators ${\mathsf{A}}((x,\lambda))={{\mathbb{E}}}_{\mu}A(\cdot,(x,\lambda))$ and ${\mathsf{B}}((x,\lambda))={{\mathbb{E}}}_{\mu}B(\cdot,(x,\lambda))$ can be written as

[TABLE]

thus, these monotone operators are both maximal. By [4, Cor. 24.4], we also get that ${\mathsf{A}}+{\mathsf{B}}$ belong to ${\mathscr{M}}({\mathcal{Y}})$ . Moreover, recalling the system of inclusions (3), we also obtain that ${{\mathcal{S}}}=Z({\mathsf{A}}+{\mathsf{B}})$ .

Defining the function

[TABLE]

(obviously, $b(s,(x,\lambda))\in B(s,(x,\lambda))$ $\mu$ -a.e.), let us consider the following version of the FB algorithm

[TABLE]

On the one hand, one can easily check that this is exactly Algorithm (4). On the other hand, this algorithm is an instance of the random FB algorithm studied in [6]. By checking the assumptions of Theorem 2.1 one by one, one sees that the assumptions of [6, Th. 3.1 and Cor. 3.1] are verified. Theorem 2.1 follows.

Remark 1

The convergence stated by Theorem 2.1 concerns the averaged sequence $(\bar{x}_{n},\bar{\lambda}_{n})$ . One can ask whether the sequence $(x_{n},\lambda_{n})$ itself converges to ${{\mathcal{S}}}$ . This would happen if the operator ${\mathsf{A}}+{\mathsf{B}}$ were so-called demipositive [6]. This happens when, e.g., ${\mathsf{F}}+{\mathsf{G}}$ is strongly convex and ${\mathsf{H}}$ is smooth (proof omitted). Unfortunately, demipositivity of ${\mathsf{A}}+{\mathsf{B}}$ is not always guaranteed.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. F. Atchadé, G. Fort, and E. Moulines. On perturbed proximal gradient algorithms. Journal of Machine Learning Research , 18(1):310–342, 2017.
2[2] H. Attouch. Familles d’opérateurs maximaux monotones et mesurabilité. Annali di Matematica Pura ed Applicata , 120(1):35–111, 1979.
3[3] H. H. Bauschke, J. M. Borwein, and W. Li. Strong conical hull intersection property, bounded linear regularity, Jameson’s property (G), and error bounds in convex optimization. Mathematical Programming , 86(1):135–160, 1999.
4[4] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
5[5] P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization , 26(4):2235–2260, 2016.
6[6] P. Bianchi and W. Hachem. Dynamical behavior of a stochastic forward-backward algorithm using random monotone operators. Journal of Optimization Theory and Applications , 171(1):90–120, 2016.
7[7] P. Bianchi, W. Hachem, and F. Iutzeler. A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Transactions on Automatic Control , 61(10):2947–2957, Oct 2016.
8[8] P. Bianchi, W. Hachem, and A. Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. Journal of Convex Analysis , 26(2):397–436, 2019.