Incremental constraint projection methods for monotone stochastic   variational inequalities

Alfredo Iusem; Alejandro Jofr\'e; Philip Thompson

arXiv:1703.00272·math.OC·March 3, 2017·Math. Oper. Res.

Incremental constraint projection methods for monotone stochastic variational inequalities

Alfredo Iusem, Alejandro Jofr\'e, Philip Thompson

PDF

TL;DR

This paper introduces an incremental constraint projection method for stochastic variational inequalities with monotone operators, achieving convergence and rate guarantees suitable for large-scale, online, and distributed applications.

Contribution

It proposes a novel incremental projection approach combining stochastic approximation with constraint sampling, extending to weak-sharp and monotone cases with convergence rates.

Findings

01

Achieves $O(1/k)$ feasibility rate in mean squared distance.

02

Provides $O(1/\sqrt{k})$ solvability rate for weak-sharp cases.

03

Extends to distributed stochastic Nash games with near-optimal convergence.

Abstract

We consider stochastic variational inequalities with monotone operators defined as the expected value of a random operator. We assume the feasible set is the intersection of a large family of convex sets. We propose a method that combines stochastic approximation with incremental constraint projections meaning that at each iteration, a step similar to some variant of a deterministic projection method is taken after the random operator is sampled and a component of the intersection defining the feasible set is chosen at random. Such sequential scheme is well suited for applications involving large data sets, online optimization and distributed learning. First, we assume that the variational inequality is weak-sharp. We provide asymptotic convergence, feasibility rate of $O (1/ k)$ in terms of the mean squared distance to the feasible set and solvability rate of $O (1/ k)$ (up to first…

Equations524

⟨ T (x^{*}), x - x^{*} ⟩ \geq 0.

⟨ T (x^{*}), x - x^{*} ⟩ \geq 0.

T (x) = E [F (v, x)] .

T (x) = E [F (v, x)] .

X = \cap_{i \in I} X_{i},

X = \cap_{i \in I} X_{i},

x^{k + 1} = Π [x^{k} - α_{k} T (x^{k})],

x^{k + 1} = Π [x^{k} - α_{k} T (x^{k})],

z^{k} = Π [x^{k} - α_{k} T (x^{k})],

z^{k} = Π [x^{k} - α_{k} T (x^{k})],

x^{k + 1} = Π [x^{k} - α_{k} T (z^{k})],

X = X_{0} \cap (\cap_{i \in I} X_{i}),

X = X_{0} \cap (\cap_{i \in I} X_{i}),

X_{i} = {x \in R^{n} : g_{i} (x) \leq 0},

X_{i} = {x \in R^{n} : g_{i} (x) \leq 0},

y^{k} = Π_{X_{0}} [x^{k} - α_{k} \nabla f (x^{k})],

y^{k} = Π_{X_{0}} [x^{k} - α_{k} \nabla f (x^{k})],

x^{k + 1} = Π_{X_{0}} [y^{k} - β_{k} \frac{g _{ω_{k}}^{+} ( y ^{k} )}{∥ d ^{k} ∥ ^{2}} d^{k}],

x^{k + 1} = Π_{X_{0}} [y^{k} - β_{k} \frac{g _{ω_{k}}^{+} ( y ^{k} )}{∥ d ^{k} ∥ ^{2}} d^{k}],

x^{k + 1} = Π [x^{k} - α_{k} F (v^{k}, x^{k})],

x^{k + 1} = Π [x^{k} - α_{k} F (v^{k}, x^{k})],

E [∥ F (v, x) - T (x) ∥^{2}] \leq σ^{2} .

E [∥ F (v, x) - T (x) ∥^{2}] \leq σ^{2} .

y^{k}

y^{k}

x^{k + 1}

x_{j}^{k + 1} = Π_{X^{j}} [x_{j}^{k} - α_{k, j} (F_{j} (v_{j}^{k}, x^{k}) + ϵ_{k, j} x_{j}^{k})],

x_{j}^{k + 1} = Π_{X^{j}} [x_{j}^{k} - α_{k, j} (F_{j} (v_{j}^{k}, x^{k}) + ϵ_{k, j} x_{j}^{k})],

y^{k} = Π_{X_{0}} [x^{k} - α_{k} (F (v^{k}, x^{k}) + ϵ_{k} x^{k})],

y^{k} = Π_{X_{0}} [x^{k} - α_{k} (F (v^{k}, x^{k}) + ϵ_{k} x^{k})],

x^{k + 1} = Π_{X_{0}} [y^{k} - β_{k} \frac{g _{ω_{k}}^{+} ( y ^{k} )}{∥ d ^{k} ∥ ^{2}} d^{k}],

\frac{x - Π _{X_{i}} ( x )}{g _{i} ( x )} = \frac{x - Π _{X_{i}} ( x )}{∥ x - Π _{X_{i}} ( x ) ∥} \in \partial g_{i} (x),

\frac{x - Π _{X_{i}} ( x )}{g _{i} ( x )} = \frac{x - Π _{X_{i}} ( x )}{∥ x - Π _{X_{i}} ( x ) ∥} \in \partial g_{i} (x),

y^{k}

y^{k}

x^{k + 1}

x^{k + 1} = Π_{X_{ω_{k}}} [x^{k} - α_{k} F (v^{k}, x^{k})] .

x^{k + 1} = Π_{X_{ω_{k}}} [x^{k} - α_{k} F (v^{k}, x^{k})] .

X^{j} = X_{0}^{j} \cap (\cap_{i \in I_{j}} X_{i}^{j}),

X^{j} = X_{0}^{j} \cap (\cap_{i \in I_{j}} X_{i}^{j}),

X_{i}^{j} = {x \in R^{n_{j}} : g_{i} (j ∣ x) \leq 0},

X_{i}^{j} = {x \in R^{n_{j}} : g_{i} (j ∣ x) \leq 0},

y_{j}^{k}

y_{j}^{k}

x_{j}^{k + 1}

k = 0 \sum \infty \frac{( α _{k, m a x} - α _{k, m i n} ) ^{2}}{α _{k, m i n} ϵ _{k, m i n}} < \infty,

k = 0 \sum \infty \frac{( α _{k, m a x} - α _{k, m i n} ) ^{2}}{α _{k, m i n} ϵ _{k, m i n}} < \infty,

2 ⟨ z, y - u ⟩ \leq ∥ x - u ∥^{2} - ∥ y - u ∥^{2} - ∥ y - x ∥^{2} .

2 ⟨ z, y - u ⟩ \leq ∥ x - u ∥^{2} - ∥ y - u ∥^{2} - ∥ y - x ∥^{2} .

y

y

x_{2}

∥ x_{2} - x_{0} ∥^{2} \leq ∥ x_{1} - x_{0} ∥^{2} - 2 α ⟨ x_{1} - x_{0}, u ⟩ + [1 + τ β (2 - β)] α^{2} ∥ u ∥^{2} - \frac{β ( 2 - β )}{C _{g}^{2}} (1 - \frac{1}{τ}) (g^{+} (x_{1}))^{2} .

∥ x_{2} - x_{0} ∥^{2} \leq ∥ x_{1} - x_{0} ∥^{2} - 2 α ⟨ x_{1} - x_{0}, u ⟩ + [1 + τ β (2 - β)] α^{2} ∥ u ∥^{2} - \frac{β ( 2 - β )}{C _{g}^{2}} (1 - \frac{1}{τ}) (g^{+} (x_{1}))^{2} .

N_{X} (x) = {v \in R^{n} : ⟨ v, y - x ⟩ \leq 0, \forall y \in X},

N_{X} (x) = {v \in R^{n} : ⟨ v, y - x ⟩ \leq 0, \forall y \in X},

T_{X} (x) = {d \in R^{n} : \exists t_{k} > 0, \exists d^{k} \in R^{n}, \forall k \in N, x + t_{k} d^{k} \in X, d^{k} \to d} .

T_{X} (x) = {d \in R^{n} : \exists t_{k} > 0, \exists d^{k} \in R^{n}, \forall k \in N, x + t_{k} d^{k} \in X, d^{k} \to d} .

T_{X} (x) = \mbox cl {α (y - x) : α > 0, y \in X} = [N_{X} (x)]^{\circ},

T_{X} (x) = \mbox cl {α (y - x) : α > 0, y \in X} = [N_{X} (x)]^{\circ},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Incremental constraint projection methods for monotone stochastic variational inequalities

A. N. Iusem, Instituto Nacional de Matemática Pura e Aplicada (IMPA), [email protected]

Alejandro Jofré, Center for Mathematical Modeling (CMM) & DIM, [email protected]

Philip Thompson, Instituto Nacional de Matemática Pura e Aplicada (IMPA), [email protected]

Abstract

We consider stochastic variational inequalities with monotone operators defined as the expected value of a random operator. We assume the feasible set is the intersection of a large family of convex sets. We propose a method that combines stochastic approximation with incremental constraint projections meaning that at each iteration, a step similar to some variant of a deterministic projection method is taken after the random operator is sampled and a component of the intersection defining the feasible set is chosen at random. Such sequential scheme is well suited for applications involving large data sets, online optimization and distributed learning. First, we assume that the variational inequality is weak-sharp. We provide asymptotic convergence, feasibility rate of $O(1/k)$ in terms of the mean squared distance to the feasible set and solvability rate of $O(1/\sqrt{k})$ (up to first order logarithmic terms) in terms of the mean distance to the solution set for a bounded or unbounded feasible set. Then, we assume just monotonicity of the operator and introduce an explicit iterative Tykhonov regularization to the method. We consider Cartesian variational inequalities so as to encompass the distributed solution of stochastic Nash games or multi-agent optimization problems under a limited coordination. We provide asymptotic convergence, feasibility rate of $O(1/k)$ in terms of the mean squared distance to the feasible set and, in the case of a compact set, we provide a near-optimal solvability convergence rate of $O\left(\frac{k^{\delta}\ln k}{\sqrt{k}}\right)$ in terms of the mean dual gap-function of the SVI for arbitrarily small $\delta>0$ .

1 Introduction

The standard (deterministic) variational inequality problem, which we will denote as VI( $T,X)$ or simply VI, is defined as follows: given a closed and convex set $X\subset\mathbb{R}^{n}$ and a single-valued operator $T:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ , find $x^{*}\in X$ such that, for all $x\in X$ ,

[TABLE]

We shall denote by $X^{*}$ the solution set of VI $(T,X)$ . The variational inequality problem includes many interesting special classes of variational problems with applications in economics, game theory and engineering. The basic prototype is smooth convex optimization, where $T$ is the gradient of a smooth function. Other classes of problems are posed as variational inequalities which are not equivalent to optimization problems, such as complementarity problems (with $X=\mathbb{R}^{n}_{+}$ ), system of equations (with $X=\mathbb{R}^{n}$ ), saddle-point problems and many different classes of equilibrium problems.

In the stochastic case, we start with a measurable space $(\Xi,\mathcal{G})$ , a measurable (random) operator $F:\Xi\times\mathbb{R}^{n}\to\mathbb{R}^{n}$ and a random variable $v:\Omega\rightarrow\Xi$ defined on a probability space $(\Omega,\mathcal{F},\mathbb{P})$ which induces an expectation $\mathbb{E}$ and a distribution $\mathbb{P}_{v}$ of $v$ . When no confusion arises, we use $v$ to also denote a random sample $v\in\Xi$ . We assume that for every $x\in\mathbb{R}^{n}$ , $F(v,x):\Omega\rightarrow\mathbb{R}^{n}$ is an integrable random vector. The solution criterion analyzed in this paper consists of solving VI( $T,X$ ) as defined by (1), where $T:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is the expected value of $F(v,\cdot)$ , i.e., for any $x\in\mathbb{R}^{n}$ ,

[TABLE]

Precisely, the definition of stochastic variational inequality problem (SVI) is:

Definition 1 (SVI).

Assuming that $T:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ is given by $T(x)=\mathbb{E}[F(\xi,x)]$ for all $x\in\mathbb{R}^{n}$ , the SVI problem consists of finding $x^{*}\in X$ , such that $\langle T(x^{*}),x-x^{*}\rangle\geq 0$ for all $x\in X$ .

Such formulation of SVI is also called expected value formulation. It goes back to Gürkan et al. [19], as a natural generalization of stochastic optimization (SP) problems. Recently, a more general definition of stochastic variational inequality was considered in Chen et al. [15] where the feasible set is also affected by randomness, that is, $X:\Xi\rightrightarrows\mathbb{R}^{n}$ is a random set-valued function.

Methods for the deterministic VI( $T,X$ ) have been extensively studied (see Facchinei and Pang [17]). If $T$ is fully available then SVI can be solved by these methods. As in the case of SP, the SVI in Definition 1 becomes very different from the deterministic setting when $T$ is not available. This is often the case in practice due to expensive computation of the expectation in (2), unavailability of $\mathbb{P}_{v}$ or no close form of $F(v,\cdot)$ . This requires sampling the random variable $v$ and the use of values of $F(\eta,x)$ given a sample $\eta$ of $v$ and a current point $x\in\mathbb{R}^{n}$ (a procedure often called “stochastic oracle” call). In this context, there are two current methodologies for solving the SVI problem: sample average approximation (SAA) and stochastic approximation (SA). In this paper we focus on the SA approach.

The SA methodology for SP or SVI can be seen as a projection-type method where the exact mean operator $T$ is replaced along the iterations by a random sample of $F$ . This approach induces an stochastic error $F(v,x)-T(x)$ for $x\in X$ along the trajectory of the method. When $X=\mathbb{R}^{n}$ , Definition 1 becomes the stochastic equation problem (SE): under (2), almost surely find $x^{*}\in\mathbb{R}^{n}$ such that $T(x^{*})=0$ . The SA methodology was first proposed by Robbins and Monro in [40] for the SE problem in the case in which $T$ is the gradient of a strongly convex function under specific conditions. Since this fundamental work, SA approaches to SP and, more recently for SVI, have been carried out in Jiang and Xu [23], Juditsky et al. [24], Yousefian et al. [46], Koshal et al. [29], Wang and Bertsekas [43], Chen et al. [14], Yousefian et al. [47], Kannan and Shanbhag [25], Yousefian et al. [45]. See Bach and Moulines [2] for the stochastic approximation procedure in machine learning and online optimization.

A frequent additional difficulty is the possibly complicated structure of the feasible set $X$ . Often, the feasible set takes the form

[TABLE]

where $\{X_{i}:i\in\mathcal{I}\}$ is an arbitrarily family of closed convex sets. There are different motivations for considering the design of algorithms which, at every iteration, use only a component $X_{i}$ rather than the whole feasible set $X$ . First, in the case of projection methods, when the orthogonal projection onto each $X_{i}$ , namely $\Pi_{i}:\mathbb{R}^{n}\to X_{i}$ , is much easier to compute than the projection onto $X$ , namely $\Pi:\mathbb{R}^{n}\to X$ , a natural idea consists of replacing, at iteration $k$ , $\Pi$ by one of the $\Pi_{i}$ ’s, say $\Pi_{i_{k}}$ , or even by an approximation of $\Pi_{i}$ . This occurs, for instance, when $X$ is a polyhedron and the $X_{i}$ ’s are halfspaces. This procedure is the basis of the so called sequencial or parallel row action methods for solving systems of equations (see Censor [12]) and methods for the feasibility problem, useful in many applications, including image restoration and tomography (see, e.g., Bauschke et al. [5], Cegielski and Suchocka [11]). Second, in some cases $X$ is not known a priori, but is rather revealed through the random realizations of its components $X_{i}$ . Such problems arise in fair rate allocation problems in wireless networks where the channel state is unknown but the channel states $X_{i}$ are observed in time (see e.g. Nedić [32] and Huang et al. [20]). Third, in some cases $X$ is known but the number of constraints is prohibitively very large (e.g., in machine learning and signal processing).

1.1 Projection methods

In the deterministic setting (1), the classical projection method for VI $(T,X)$ , akin to the projected gradient method for convex optimization, is

[TABLE]

where $\Pi$ is the projection operator onto $X$ and $\{\alpha_{k}\}$ is an exogenous sequence of positive stepsizes. Convergence of this method is guaranteed assuming $T$ is strongly monotone, Lipschitz continuous and the stepsizes satisfy $\alpha_{k}\in(0,2\sigma/L^{2})$ and $\inf_{k}\alpha_{k}>0$ , where $\sigma>0$ is the modulus of strong monotonicity and $L$ is the Lipschitz constant, see e.g. Facchinei and Pang [17].

The strong monotonicity assumption is quite demanding, and convergence of (3) is not guaranteed when the operator is just monotone. In order to deal with this situation, Korpelevich [28] proposed the extra-gradient algorithm

[TABLE]

in which an additional auxiliary projection step is introduced. Convergence of the method is guaranteed when the stepsizes satisfy $\alpha_{k}\equiv\alpha\in(0,1/L)$ . In Nemirovski [35], the extra-gradient method was generalized and convergence rates were established assuming compactness of the feasible set.

Observe that the projection method (3) and the extra-gradient method (1.1) are explicit, i.e., the formula for obtaining $x^{k+1}$ is an explicit one, up to the computation of the orthogonal projection $\Pi$ . An implicit approach for the solution of monotone variational inequalities consists of a Tykhonov or proximal regularization scheme (see Facchinei and Pang [17], Chapter 12). In these methods, a sequence of regularized variational inequality problems are approximately solved at each iteration.

As commented before, a typical case occurs when the feasible set takes the form $X=\cap_{i=1}^{m}X_{i},$ where all the $X_{i}$ ’s are closed and convex. Row action methods and alternate (or cyclic) projection algorithms for convex feasibility problems exploit the computation of projections onto the components iteratively (see Bauschke [3]). In such case, the order in which the sets $X_{i}$ are used along the iterations, i.e. the so called control sequence $\{\omega_{k}\}\subset\{1,\dots,m\}$ , must be specified. Several options have been considered in the literature (such as cyclic control, almost cyclical control, most violated constraint control and random control). A negative consequence of the use of approximate projections is the need to use small stepsizes, i.e., satisfying $\sum_{k}\alpha^{2}<\infty$ and $\sum_{k}\alpha_{k}=\infty$ , which significantly reduces the efficiency of the method. We thus have a trade-off between easier projection computation and slower convergence. Additionally, the use of approximate projections require some condition on the feasible set, so that the projections onto the sets $X_{i}$ ’s are reasonable approximations of the projection onto $X$ . For this, some form of error bound, linear regularity or Slater-type conditions on the sets $X_{i}$ must be assumed (e.g., Assumption 5 in Subsection 3.2 and the comments following it). See Bauschke and Borwein [4], Deutsch and Hundal [16] and Pang [36]. Explicit methods for monotone variational inequalities using approximate projections were studied e.g. in Fukushima [18] and Censor and Gibali [13], imposing rather demanding coercivity assumptions on $T$ , in Bello Cruz and Iusem [7] assuming paramonotonicity of $T$ , and then in Bello Cruz and Iusem [8] assuming just monotonicity of $T$ . Another method of this type, using an Armijo search as in Iusem and Svaiter [22] for determining the stepsizes, and approximate projections with the most violated constraint control, can be found in Bello Cruz and Iusem [6].

Related to row-action and alternate projective methods are the so called incremental methods, introduced in Kibardin [27] (see also Luo and Tseng [30], Bertsekas [9], Nedić [32] and references therein). These methods are used for the minimization of a large sum of convex functions, e.g. in machine learning applications. In such a context, instead of using the gradient of the sum, the gradient of one of the terms is selected iteratively under different control rules. In Polyak [38], Polyak [39] and Nedić [32], incremental constraint methods with random control rules were proposed for minimizing a convex function over an intersection of a large number convex sets. The feasible set takes the form

[TABLE]

where $\{X_{0}\}\cup\{X_{i}:i\in\mathcal{I}\}$ is a collection of closed and convex subsets of $\mathbb{R}^{n}$ . The hard constraint $X_{0}$ is assumed to have easy computable projections. The soft constraints $\{X_{i}:i\in\mathcal{I}\}$ , for a given $i\in\mathcal{I}$ , has the form:

[TABLE]

for some convex function $g_{i}$ with positive part $g_{i}^{+}(x):=\max\{g_{i}(x),0\}$ and easy computable subgradients. The method on Nedić [32] is given by:

[TABLE]

where $\{\alpha_{k},\beta_{k}\}$ are positive stepsizes, $d^{k}\in\partial g^{+}_{\omega_{k}}(y^{k})\setminus\{0\}$ if $g^{+}_{\omega_{k}}(y^{k})>0$ , and $d^{k}=d$ for any $d\in\mathbb{R}^{n}\setminus\{0\}$ if $g^{+}_{\omega_{k}}(y^{k})=0$ . In the method (7)-(8), $\{\omega_{k}\}$ is a random control sequence taking values in $\mathcal{I}$ and satisfying certain conditions and $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ is a convex smooth function (the non-smooth case is also analyzed). Together with row-action and alternate projection methods, incremental constraint projection methods can be viewed as the dual version of (standard) incremental methods. More recently, stochastic approximation was incorporated to incremental constraint projections methods for stochastic convex minimization problems in Wang and Bertsekas [44].

1.2 Stochastic approximation methods

The first SA method for SVI was analyzed in Jiang and Xu [23]. Their method is:

[TABLE]

where $\Pi$ is the Euclidean projection onto $X$ , $\{v^{k}\}$ is a sample of $v$ and $\{\alpha_{k}\}$ is a sequence of positive steps. The a.s. convergence is proved assuming $L$ -Lipschitz continuity of $T$ , strong monotonicity or strict monotonicity of $T$ , stepsizes satisfying $\sum_{k}\alpha_{k}=\infty,\sum_{k}\alpha_{k}^{2}<\infty$ (with $0<\alpha_{k}<2\rho/L^{2}$ in the case where $T$ is $\rho$ -strongly monotone) and an unbiased oracle with uniform variance, i.e., there exists $\sigma>0$ such that for all $x\in X$ ,

[TABLE]

After the above mentioned work, recent research on SA methods for SVI have been developed in Juditsky et al. [24], Yousefian et al. [45, 46, 47], Koshal et al. [29], Chen et al. [14], Kannan and Shanbhag [25]. Two of the main concerns in these papers were the extension of the SA approach to the general monotone case and the derivation of (optimal) convergence rate and complexity results with respect to known metrics associated to the VI problem. In order to analyze the monotone case, SA methodologies based on the extragradient method of Korpelevich [28], the mirror-prox algorithm of Nemirovski [35] and iterative Tykhonov and proximal regularization procedures (see Kannan and Shandbag [26]), were used in these works. Other objectives were the use of incremental constraint projections in the case of difficulties accessing the feasible set in Wang and Bertsekas [43], the convergence analysis in the absence of the Lipschitz constant in Yousefian et al. [45, 46, 47], and the distributed solution of Cartesian variational inequalities in Yousefian et al. [46], Koshal et al. [29].

We finally make some comments on two recent methods upon which we make substantial improvements.

In Wang and Bertsekas [43], method (9) is improved by incorporating an incremental projection scheme, instead of exact ones. They take $X=\cap_{i\in\mathcal{I}}X_{i}$ , where $\mathcal{I}$ is a finite index set, and use a random control sequence, where both the random map $F$ and the control sequence $\{\omega_{k}\}$ are jointly sampled, giving rise to the following algorithm:

[TABLE]

where $\{\alpha_{k},\beta_{k}\}$ are positive stepsizes and $\{v^{k}\}$ are samples. When $\beta_{k}\equiv 1$ , the method is the version of method (9) with incremental constraint projections. For convergence, the operator is assumed to be strongly monotone and Lipschitz-continuous and knowledge of the strong monotonicity and Lipschitz moduli are required for computing the stepsizes. In this setting, method (1.2) improves upon method (7)-(8) when $X_{0}=\mathbb{R}^{n}$ , $\mathcal{I}$ is finite and the projection onto each $X_{i}$ is easy.

Regularized iterative Tychonov and proximal point methods for monotone stochastic variational inequalities were introduced in Koshal et al. [29]. In such methods, instead of solving a sequence of regularized variational inequality problems, the regularization parameter is updated in each iteration and a single projection step associated with the regularized problem is taken. This is desirable since (differently from the deterministic case), termination criteria are generally hard to meet in the stochastic setting. The algorithm proposed allows for a Cartesian structure on the variational inequality, so as to encompass the distributed solution of Cartesian SVIs. Namely, the feasible set $X\subset\mathbb{R}^{n}$ has the the form $X=X^{1}\times\cdots\times X^{m},$ where each Cartesian component $X^{j}\subset\mathbb{R}^{n_{j}}$ is a closed and convex set, $v=(v_{1},\ldots,v_{m})$ and the random operator has components $F=(F_{1}(v_{1},\cdot),\ldots,F_{m}(v_{m},\cdot))$ with $F_{j}(v_{j},\cdot):\Xi\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{n_{j}}$ for $j=1,\ldots,m$ and $\sum_{j=1}^{m}n_{j}=n$ . The algorithm in Koshal et al. [29] is described as follows. Given the $k$ -th iterate $x^{k}\in X$ with components $x^{k}_{j}\in X^{j}$ , for $j=1,\ldots,m$ , the next iterate is given by the distributed projection computations: for $j=1,\ldots,m$ ,

[TABLE]

where $\{\alpha_{k,1},\ldots,\alpha_{k,m}\}$ are the stepsize sequences, $\{\epsilon_{k,1},\ldots,\epsilon_{k,m}\}$ are the regularization parameter sequences and $\{v_{1}^{k},\ldots,v_{m}^{k}\}$ are the samples. This method is shown to converge under monotonicity and Lipschitz-continuity of $T$ and a partial coordination between the stepsize and regularization parameter sequences (see Assumption 10). The iterative proximal point follows a similar pattern but differently from the Tykhonov method, this method requires strict monotonicity, which in particular implies uniqueness of solutions. It should me mentioned that two important classes of problems which can be formulated as stochastic Cartesian variational inequalities are the stochastic Nash equilibria and the stochastic multi-user optimization problem; see Koshal et al. [29] for a precise definition. In these problems, the $i$ -th agent has only access to its constraint set $X^{i}$ and $F_{i}$ (which depends on other agents decisions) so that a distributed solution of the SVI is required. Moreover, it is convenient to allow agents to update independently their stepsizes and regularization sequences, subjected just to a limited coordination.

1.3 Proposed methods and contributions

In many stochastic approximation methods, the stochastic error $\varsigma(x):=F(v,x)-T(x)$ is assumed to be bounded, demanding the use of small stepsizes with a slow performance. In this case, the use of easily computable approximate projections, instead of exact ones, can significantly improve the performance of the algorithm. Additionally, in many cases the constraint set $X$ is known, but it contains a very large number of constraints, or $X$ is not known a priori, but is rather learned along time through random samples of its constraints. An important feature of incremental constraint projection methods is that they process sample operators and sample constraints sequentially. This incremental structure is well suited for a variety of applications involving large data sets, online optimization and distributed learning. For problems that require online learning, incremental projection methods of the type (7)-(8) or (1.2) are practically the only option to be used without the knowledge of all the constraints.

In view of these considerations, we wish to devise methods which incorporate incremental constraint projections with stochastic approximation of the operator. There has been only one previous work on incremental projections for SVIs, namely Wang and Bertsekas [43]. In this work strong monotonicity of the operator and knowledge of the strong monotonicity and Lipschitz moduli were assumed. These are very demanding assumptions in practice and theory. Our first objective is to weaken such property to plain monotonicity without requiring knowledge of the Lipschitz constant. Our second objective is to use incremental constraint projections in distributed methods for multi-agent optimization and equilibrium problems arising in networks. Such joint analysis seems to be new (to the best of our knowledge, all previous works in distributed methods for such problems use exact projections). This objective is a non-trivial generalization of previous known distributed methods since, besides preserving the parallel computations of projections and the use of asynchronous agent’s parameters of such methods, we wish to allow each user to project inexactly over its decision set in a random fashion and without additional coordination.

Assuming the structures (5)-(6), in the centralized case ( $m=1$ ), we propose the following incremental constraint projection method:

[TABLE]

where $\{\alpha_{k},\beta_{k}\}$ are stepsize sequences, $\{\epsilon_{k}\}$ is the regularization parameter sequence, $\{v^{k}\}$ is the sample sequence, $\{\omega_{k}\}$ is the random control, and $d^{k}\in\partial g^{+}_{\omega_{k}}(y^{k})\setminus-\{0\}$ if $g_{\omega_{k}}(y^{k})>0$ and $d^{k}=d$ for any $d\in\mathbb{R}^{n}\setminus\{0\}$ otherwise. We remark that the projection onto $X_{0}$ in (13) is dispensable if $\operatorname*{dom}(g_{i})=\mathbb{R}^{n}$ and $\{\partial g^{+}_{i}:i\in\mathcal{I}\}$ is uniformly bounded on $\mathbb{R}^{n}$ , a condition satisfied, e.g., if the soft constraints have easy computable projections, as commented below (see Remark 1 in Subsection 2.1). The above incremental algorithm advances in such a way that the “operator step” and the “feasibility step” are updated in separate stages. In the first stage, given the current iterate $x^{k}$ , the method advances in the direction of a sample $-F(v^{k},x^{k})$ of the random operator, producing an auxiliary iterate $y^{k}$ . In this step, the hard constraint set $X_{0}$ is considered while the soft constraints $\{X_{i}:i\in\mathcal{I}\}$ are “ignored”. In the second stage, a soft constraint $X_{\omega_{k}}$ is randomly chosen with $\omega_{k}\in\mathcal{I}$ , and the method advances in the direction opposite to a subgradient of $g^{+}_{\omega_{k}}$ at the point $y^{k}$ , producing the next iterate $x^{k+1}$ . Thus, the method exploits simultaneously the stochastic approximation of the random operator (in the first stage) and a randomization of the incremental selection of constraint projections (in the second stage). In Section 3, this method is analyzed with no regularization, i.e., $\epsilon^{k}\equiv 0$ and the monotone operator satisfies the weak sharpness property (see Section 2.3) while in Section 4, we consider the same method with positive regularization parameters requiring just monotonicity of the operator.

We make some remarks to illustrate that the mentioned framework is very general. If, for $i\in\mathcal{I}$ , the Euclidean projection onto $X_{i}$ is easy, then we can always construct a function with “easy” subgradients. Indeed, defining the function $g_{i}(x):=\operatorname*{d}(x,X_{i}),$ for $x\in\mathbb{R}^{n}$ , then $g_{i}$ is convex, nonnegative and finite valued over $\mathbb{R}^{n}$ , and for any $x\notin X_{i}$ ,

[TABLE]

provides a subgradient which is easy to evaluate. Moreover, $\sup_{d\in\partial g_{i}(x)}\|d\|\leq 1$ for all $x\in\mathbb{R}^{n}$ . In this case, using the above directions as subgradients $d^{k}$ of $g^{+}_{\omega_{k}}$ at $y^{k}$ , method (13)-(14) can be rewritten as

[TABLE]

If, additionally, $X_{0}=\mathbb{R}^{n}$ and $\beta_{k}\equiv 1$ then the method takes the more basic form

[TABLE]

In Section 4, we analyse a distributed variant. In this setting, the feasible set $X\subset\mathbb{R}^{n}$ has the form $X=X^{1}\times\cdots\times X^{m},$ where each Cartesian component $X^{j}\subset\mathbb{R}^{n_{j}}$ is a closed and convex set, $F(v,\cdot)=(F_{1}(v_{1},\cdot),\ldots,F_{m}(v_{m},\cdot))$ with $v=(v_{1},\ldots,v_{m})$ , $F_{j}(v_{j},\cdot):\Xi\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{n_{j}}$ for $j=1,\ldots,m$ and $\sum_{j=1}^{m}n_{j}=n$ . Moreover, we assume each Cartesian component has the constraint form

[TABLE]

where $\{X_{0}^{j}\}\cup\{X_{i}^{j}:i\in\mathcal{I}_{j}\}$ is a collection of closed and convex subsets of $\mathbb{R}^{n_{j}}$ . Also, for every $i\in\mathcal{I}_{j}$ , we assume $X_{i}^{j}$ is representable in $\mathbb{R}^{n_{j}}$ as

[TABLE]

for some convex function $g_{i}(j|\cdot):\mathbb{R}^{n_{j}}\rightarrow\mathbb{R}\cup\{\infty\}$ . We thus propose the following distributed method: for each $j=1,\ldots,m$ ,

[TABLE]

where, for every agent $j=1,\ldots,m$ , $\{\alpha_{k,j},\beta_{k,j}\}$ are stepsize sequences, $\{\epsilon_{k,j}\}$ is the regularization parameter sequence, $\{v^{k}_{j}\}$ is the sample sequence, $\{\omega_{k,j}\}$ is the random control and $d^{k}_{j}\in\partial g^{+}_{\omega_{k,j}}(j|y^{k}_{j})\setminus\{0\}$ if $g_{\omega_{k,j}}(j|y^{k}_{j})>0$ , and $d^{k}_{j}=d$ for any $d\in\mathbb{R}^{n_{j}}\setminus\{0\}$ otherwise. Method (13)-(14) is the special case of (17)-(18) with $m=1$ .

We mention the following contributions of methods (13)-(14) and (17)-(18):

(i)

Incremental constraint projection methods for plain monotone SVIs: In Wang and Bertsekas [43], incremental constraint projection methods for SVIs were proposed assuming strong monotonicity with knowledge of the strong monotonicity and Lipschitz moduli. We propose a method with incremental constraint projections for SVIs requiring just monotonicity with no knowledge of the Lipschitz constant, making our method much more general and applicable. Using explicit stepsizes, we establish almost sure asymptotic convergence, feasibility rate of $O(1/k)$ in terms of the mean squared distance to the feasible set and, in the case of a compact set, we provide a near optimal solvability convergence rate of $O\left(\frac{k^{\delta}\ln k}{\sqrt{k}}\right)$ in terms of the mean dual gap function of the SVI for arbitrary small $\delta>0$ .

(ii)

Incremental constraint projections in distributed methods: Distributed methods for SVIs have recently attained importance recently in the framework of optimization or equilibrium problems in networks. In this context, one important goal is to allow distributed computation of projections, allow agents to update their parameters independently and drop the strong or strict monotonicity property without indirect regularization which is hard to cope with in the stochastic setting. The work in Koshal et al. [29] addresses these issues but using exact projections, and to the best of our knowledge, all previous works in distributed methods, even for convex optimization, seem to project exactly. Our main contribution in this context is to include incremental projections in distributed methods for SVI (and in particular for stochastic optimization). In this context, we allow agents to project randomly in simpler components of its own decision set without information of other agents’ decision sets. Importantly, we preserve all properties in Koshal et al. [29] just mentioned. The use of incremental projections allows easier computation of projections or flexibility when the constraints are learned via an online procedure. In order to achieve such contribution, we deal with a more refined convergence analysis and a new partial coordination assumption, not needed in the case of synchronous stepsizes or exact projections:

[TABLE]

where $\alpha_{k,\max}=\max_{i=1,\ldots,m}\alpha_{k,i}$ , $\alpha_{k,\min}=\min_{i=1,\ldots,m}\alpha_{k,i}$ and $\epsilon_{k,\min}=\min_{i=1,\ldots,m}\epsilon_{k,i}$ . Using explicit asyncronous stepsizes and regularization sequences, we establish a.s. asymptotic convergence, feasibility rate of $O(1/k)$ in terms of the mean squared distance to the feasible set and, in the case of a compact feasible set, we provide a near optimal solvability convergence rate of $O\left(\frac{k^{\delta}\ln k}{\sqrt{k}}\right)$ in terms of the mean dual gap function of the SVI for arbitrary small $\delta>0$ . The partial coordination (19) appears in the rate statements as a decaying error related to the use of asynchronous stepsizes and asynchronous inexact random projections. To the best of our knowledge, even for the case of exact projections no convergence rates have been reported for iterative distributed methods for SVIs.

(iii)

Weak sharpness property and incremental projections: The weak sharpness property for VIs was proposed in [31]. It has been used as a sufficient condition for finite convergence of algorithms for optimization and VI problems in numerous works, e.g. [31, 10]. To the best of our knowledge, the use of the weak sharpness property as a suitable property for incremental projection methods, as analyzed in this work, has not been addressed before, even for VIs or optimization problems in the deterministic setting. We use an equivalent form of weak sharpness suitable for incremental projections. The proof of such equivalence seems to be new. Using explicit stepsizes without knowledge of the sharp-modulus, we prove a.s. asymptotic convergence, feasibility rate of $O(1/k)$ in terms of the mean squared distance to the feasible set and solvability rate of $O(1/\sqrt{k})$ (up to first order logarithmic terms) in terms of the mean distance to the solution set, for bounded or unbounded feasible sets. We also prove that after a finite number of iterations, any solution of a stochastic optimization problem with linear objective and the same feasible set as the SVI is a solution of the original SVI. We note that the weak sharpness property differs from strong monotonicity, allowing nonunique solutions. In that respect such analysis complements item (i) above.

The paper is organized as follows: Section 2 includes preliminary results such as tools from the projection operator and probability, as well as required preliminaries on the weak sharpness property. Section 3 analyzes the method for weak sharp monotone operators. Subsection 3.4 presents the correspondent complexity analysis. Section 4 deals with the regularized version for general monotone operators. Subsection 4.6 presents the correspondent complexity analysis. We list the assumptions in each section, along with the algorithm statements and their convergence analysis.

2 Preliminaries

2.1 Projection operator and notation

For $x,y\in\mathbb{R}^{n}$ , we denote $\langle x,y\rangle$ the standard inner product and $\|x\|=\sqrt{\langle x,x\rangle}$ the correspondent Euclidean norm. We shall denote by $\operatorname*{d}(\cdot,C)$ the distance function to a general set $C$ , namely, $\operatorname*{d}(x,C)=\inf\{\|x-y\|:y\in C\}$ . For $X$ as in Definition 1 we denote $\operatorname*{d}(x):=\operatorname*{d}(x,X)$ . By $\operatorname*{\mbox{cl}}C$ and $\mathcal{D}(C)$ we denote the closure and the diameter of the set $C$ , respectively. For a closed and convex set $C\subset\mathbb{R}^{n}$ , we denote by $\Pi_{C}$ the orthogonal projection onto $C$ . For a function $g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ we denote by $g^{+}$ its positive part, defined by $g^{+}(x)=\max\{0,g(x)\}$ for $x\in\mathbb{R}^{n}$ . If $g$ is convex, we denote by $\partial g$ its subdifferential and $\operatorname*{dom}(g)$ its domain.

The following properties of the projection operator are well known; see e.g. Facchinei and Pang [17] and Auslender and Teboulle [1].

Lemma 1.

Take a closed and convex set $C\subset\mathbb{R}^{n}$ . Then

i)

For all $x\in\mathbb{R}^{n},y\in C$ , $\langle x-\Pi_{C}(x),y-\Pi_{C}(x)\rangle\leq 0.$

ii)

For all $x,y\in\mathbb{R}^{n}$ , $\|\Pi_{C}(x)-\Pi_{C}(y)\|\leq\|x-y\|.$

iii)

Let $z\in\mathbb{R}^{n}$ , $x\in C$ with $y:=\Pi_{C}[x-z]$ . Then for all $u\in C$ ,

[TABLE]

The following lemma will be used in the analysis of the methods of Sections 3 and 4. It is proved in Nedić [32] and Polyak [38], but in a slightly different form, suitable for convex optimization problems. The changes required for the case of monotone variational inequalities are straightforward.

Lemma 2.

Consider a closed and convex $X_{0}\subset\mathbb{R}^{n}$ , and let $g:\mathbb{R}^{n}\to\mathbb{R}\cup\{\infty\}$ be a convex function with $\operatorname*{dom}(g)\subset X_{0}$ . Suppose that there exists $C_{g}>0$ such that $\|z\|\leq C_{g}$ for all $x\in X_{0}$ and all $z\in\partial g^{+}(x)$ . Take $x_{1}\in X_{0}$ , $u\in\mathbb{R}^{n}$ , $\alpha>0$ , $\beta\in(0,2)$ and define $y,x_{2}\in X_{0}$ as

[TABLE]

where $d\in\mathbb{R}^{n}-\{0\}$ is such that $d\in\partial g^{+}(y)-\{0\}$ if $g^{+}(y)>0$ . Then for any $x_{0}\in X_{0}$ such that $g^{+}(x_{0})=0$ , and any $\tau>0$ , it holds that

[TABLE]

Remark 1.

We remark that if $\operatorname*{dom}(g)=\mathbb{R}^{n}$ and the subgradients of $g^{+}$ are uniformly bounded over $\mathbb{R}^{n}$ , then the result of Lemma 2 holds with $y\in\mathbb{R}^{n}$ given as $y=x_{1}-\alpha u,$ instead of $y=\Pi_{X_{0}}\left[x_{1}-\alpha u\right]$ .

The abbreviation “a.s.” means “almost surely” and the abbreviation “i.i.d.” means “independent and identically distributed”. Given sequences $\{x^{k}\}$ and $\{y^{k}\}$ , the notation $x^{k}=O(y^{k})$ or $x^{k}\lesssim y^{k}$ means that there exists $C>0$ , such that $\|x^{k}\|\leq C\|y^{k}\|$ for all $k$ . The notation $x^{k}\sim y^{k}$ means $x^{k}\lesssim y^{k}$ and $y^{k}\lesssim x^{k}$ . Given a $\sigma$ -algebra $\mathcal{F}$ and a random variable $\xi$ , we denote by $\mathbb{E}[\xi]$ and $\mathbb{E}[\xi|\mathcal{F}]$ the expectation and conditional expectation, respectively. Also, we write $\xi\in\mathcal{F}$ for “ $\xi$ is $\mathcal{F}$ -measurable”. $\sigma(\xi_{1},\ldots,\xi_{n})$ indicates the $\sigma$ -algebra generated by the random variables $\xi_{1},\ldots,\xi_{n}$ . $\mathbb{N}_{0}$ denotes the set of natural numbers including zero. For $m\in\mathbb{N}$ , we use the notation $[m]:=\{1,\ldots,m\}$ . For $r\in\mathbb{R}$ , $\lceil r\rceil$ denotes the smallest integer greater than $r$ . We denote by $\mathbb{R}^{m}_{>0}$ the interior of the nonnegative orthant $\mathbb{R}^{m}_{+}$ .

2.2 Probabilistic tools

As in other stochastic approximation methods, a fundamental tool to be used is the following Convergence Theorem of Robbins and Siegmund [41], which can be seen as the stochastic version of the properties of quasi-Fejér convergent sequences.

Theorem 1.

Let $\{y_{k}\},\{u_{k}\},\{a_{k}\},\{b_{k}\}$ be sequences of non negative random variables, adapted to the filtration $\{\mathcal{F}_{k}\}$ , such that a.s. $\sum a_{k}<\infty$ , $\sum b_{k}<\infty$ and for all $k\in\mathbb{N}$ , $\mathbb{E}\big{[}y_{k+1}\big{|}\mathcal{F}_{k}\big{]}\leq(1+a_{k})y_{k}-u_{k}+b_{k}.$ Then a.s. $\{y_{k}\}$ converges and $\sum u_{k}<\infty$ .

We will also use the following result, whose proof can be found in Lemma 10 of Polyak [37].

Theorem 2.

Let $\{y_{k}\},\{a_{k}\},\{b_{k}\}$ be sequences of nonnegative random variables, adapted to the filtration $\{\mathcal{F}_{k}\}$ , such that a.s. $a_{k}\in[0,1]$ , $\sum a_{k}=\infty$ , $\sum b_{k}<\infty$ , $\lim_{k\rightarrow\infty}\frac{b_{k}}{a_{k}}=0$ and for all $k\in\mathbb{N}$ , $\mathbb{E}\big{[}y_{k+1}\big{|}\mathcal{F}_{k}\big{]}\leq(1-a_{k})y_{k}+b_{k}.$ Then a.s. $\{y_{k}\}$ converges to zero.

2.3 Weak sharpness

We briefly discuss the weak sharpness property of variational inequalities. For $X\subset\mathbb{R}^{n}$ and $x\in X$ , $\mathbb{N}_{X}(x)$ denotes the normal cone of $X$ at $x$ , given by

[TABLE]

The tangent cone of $X$ at $x\in X$ is defined as

[TABLE]

For a closed and convex set $X$ , the tangent cone at a point $x\in X$ has the following alternative representations (see Rockafellar and Wets [42], Proposition 6.9 and Corollary 6.30):

[TABLE]

where for a given set $Y\subset\mathbb{R}^{n}$ , the polar set $Y^{\circ}$ is defined as $Y^{\circ}=\{v\in\mathbb{R}^{n}:\langle v,y\rangle\leq 0,\forall y\in Y\}.$

In Burke and Ferris [10], the notion of weak sharp minima for the problem $\min_{x\in X}f(x)$ with solution set $X^{*}$ was introduced: there exists $\rho>0$ such that

[TABLE]

for all $x\in X$ , where $f^{*}$ is the minimum value of $f$ at $X$ . Relation (22) means that $f-f^{*}$ gives an error bound on the solution set $X^{*}$ . In Burke and Ferris [10], it is proved that if $f$ is a closed, proper, and differentiable convex function and if the sets $X$ and $X^{*}$ are nonempty, closed, and convex, then (22) is equivalent to the following geometric condition: for all $x^{*}\in X^{*}$ ,

[TABLE]

In optimization problems, the objective function can be used for determining regularity of solutions. In variational inequalities one can use for that purpose the above geometric definition or exploit the use of gap functions associated to the VI. The dual gap function $\mathsf{G}:\mathbb{R}^{n}\to\mathbb{R}\cup\{\infty\}$ is defined as

[TABLE]

In the sequel, we denote by $B(0,1)$ the unit ball in $\mathbb{R}^{n}$ and by $X^{*}$ the solution set of VI $(T,X)$ . In order to define a meaningful notion of weak sharpness for VIs, the following statements were considered in Marcotte and Zhu [31]:

(i)

There exists $\rho>0$ , such that for all $x^{*}\in X^{*}$ ,

[TABLE]

(ii)

There exists $\rho>0$ , such that for all $x^{*}\in X^{*}$ ,

[TABLE]

(iii)

For all $x^{*}\in X^{*}$ ,

[TABLE]

(iv)

There exist $\rho>0$ such that for all $x\in X$ ,

[TABLE]

Statement (iii) is the definition of a weak sharp VI $(T,X)$ given in Marcotte and Zhu [31]. In Theorem 4.1 of Marcotte and Zhu [31], it was proved that (i)-(ii) are equivalent, and that (i)-(iv) are equivalent when $X$ is compact and $T$ is paramonotone (also known as monotone+) i.e., $T$ is motonone and $\langle T(x)-T(y),x-y\rangle=0\Rightarrow T(x)=T(y),$ for all $x,y\in\mathbb{R}^{n}$ (see Iusem [21] for other properties of paramonotone operators).

Relation (28) means that the gap function $G$ provides an error bound on the solution set $X^{*}$ . Paramonotonicity implies that $T$ is constant on the solution set $X^{*}$ . Important classes of paramonotone operators are, for example, co-coercive, symmetric monotone and strictly monotone composite operators (see Facchinei and Pang [17], Chapter 2).

Recently, the following assumption was introduced in Yousefian et al. [47]: there exists $\rho>0$ such that for all $x^{*}\in X^{*}$ and all $x\in X$ ,

[TABLE]

Clearly, (29) implies (28). We show next that (29) implies (26) and the converse statement holds when $T$ is constant on $X^{*}$ . Thus, when $T$ is constant on $X^{*}$ , (25), (26) and (29) are equivalent, and when $T$ is paramonotone and $X$ is compact, conditions (25)-(29) are all equivalent. Hence, the following proposition, which appears to be new and is proved in the Appendix, gives a precise relation between property (29) with the previous notions of weak sharpness (25)-(28) presented in Marcotte and Zhu [31]. Property (29) is well suited for the incremental constraint projection-type methods considered here.

Proposition 1.

Let $T:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ be a continuous monotone operator and $X\subset\mathbb{R}^{n}$ a closed and convex set. The following holds:

i)

Condition (29) implies (26).

ii)

If $T$ is constant on $X^{*}$ , then (26) implies (29).

Finally, we will use the following result in Theorem 4.2. of Marcotte and Zhu [31]:

Theorem 3.

If $T$ is continuous and there exists $z\in\mathbb{R}^{n}$ such that $-z\in\operatorname*{int}\left(\bigcap_{x\in X^{*}}[\mathbb{T}_{X}(x)\cap\mathbb{N}_{X^{*}}(x)]^{\circ}\right)$ , then $\operatorname*{argmin}_{x\in X}\langle z,x\rangle\subset X^{*}$ .

As a consequence of Theorem 3 under weak sharpness and uniform continuity of $T$ , any algorithm which generates a sequence $\{x^{k}\}$ such that $\operatorname*{d}(x^{k},X^{*})\rightarrow 0$ has the property that after a finite number of iterations $M$ , any solution of the auxiliary program $\min_{x\in X}\langle T(x^{M}),x\rangle,$ with a linear objective, is a solution of the original variational inequality (see Theorem 5.1 in Marcotte and Zhu [31]). When $X$ is a polyhedron, this result can be interpreted as a finite convergence property of algorithms for VI with the weak sharpness property, since a linear program is finitely solvable. Other algorithmic implications of weak sharpness are developed in Marcotte and Zhu [31].

3 An incremental projection method under weak sharpness

In the following section we assume that the feasible set has the form

[TABLE]

where $\{X_{0}\}\cup\{X_{i}:i\in\mathcal{I}\}$ is a collection of closed and convex subsets of $\mathbb{R}^{n}$ . We assume that the evaluation of the projection onto $X_{0}$ is computationally easy and that for all $i\in\mathcal{I}$ , $X_{i}$ is representable as

[TABLE]

for some convex function $g_{i}$ with $\operatorname*{dom}(g_{i})\subset X_{0}$ . Also we assume that, for every $i\in\mathcal{I}$ , subgradients of $g^{+}_{i}(x)$ at points $x\in X_{0}-X_{i}$ are easily computable and that $\{\partial g_{i}^{+}:i\in\mathcal{I}\}$ is uniformly bounded over $X_{0}$ , that is, there exists $C_{g}>0$ such that

[TABLE]

3.1 Statement of the algorithm

Next we formally state the algorithm.

Algorithm 1 (Incremental constraint projection method).

Initialization:* Choose the initial iterate $x^{0}\in\mathbb{R}^{n}$ , the stepsizes $\{\alpha_{k}\}$ and $\{\beta_{k}\}$ , the random controls $\{\omega_{k}\}$ and the operator samples $\{v^{k}\}$ .* 2. 2.

Iterative step:* Given $x^{k}$ , define:*

[TABLE]

where $d^{k}\in\partial g^{+}_{\omega_{k}}(y^{k})-\{0\}$ if $g^{+}_{\omega_{k}}(y^{k})>0$ ; $d^{k}=d\in\mathbb{R}^{n}-\{0\}$ if $g^{+}_{\omega_{k}}(y^{k})=0$ .

3.2 Discussion of the assumptions

In the sequel we consider the natural filtration

[TABLE]

Next we present the assumptions necessary for our convergence analysis.

Assumption 1 (Consistency).

The solution set $X^{*}$ of VI $(T,X)$ is nonempty.

Assumption 2 (Monotonicity).

The mean operator $T$ in (2) satisfies: for all $y,x\in\mathbb{R}^{n}$ ,

[TABLE]

Assumption 3 (Lipschitz-continuity or boundedness).

We suppose $T:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ is continuous and, at least, one of the following assumptions hold:

(i)

There exists measurable $L(v):\Omega\rightarrow\mathbb{R}_{+}$ with finite second moment, such that a.s. for all $y,x\in\mathbb{R}^{n}$ ,

[TABLE]

We denote $L:=\sqrt{\mathbb{E}[L(v)^{2}]}$ .

(ii)

There exists $C_{F}>0$ such that

[TABLE]

Item (i) implies in particular that $T$ is $L$ -Lipschitz continuous. Both items (i) or (ii) are standard in stochastic optimization. Let

[TABLE]

denote the variance of $F(v,x)$ for $x\in\mathbb{R}^{n}$ . Both item (ii) and (10) imply that the variance function $\sigma(\cdot)^{2}$ is bounded above uniformly over $X$ . Item (i) is a weaker assumption since it only requires the map $\sigma(\cdot)^{2}$ to be finite at every point in $X$ (allowing $X$ to be unbounded). Except for Wang and Bertsekas [43] in the strongly monotone case, conditions in item (ii) or in (10) were requested in all the previous literature on SA methods for SVI or stochastic optimization. Under Assumption 3(i), we do not require (10).

Assumption 4 (IID sampling).

The sequence $\{v^{k}\}$ is an independent identically distributed sample sequence of $v$ .

The above assumption implies in particular that a.s. for all $x\in\mathbb{R}^{n}$ and all $k\in\mathbb{N}$ , $\mathbb{E}\big{[}F(v^{k},x)\big{|}\mathcal{F}_{k}\big{]}=T(x).$ We now state the assumptions concerning the incremental projections.

Assumption 5 (Constraint sampling and regularity).

There exists $c>0$ such that a.s. for all $x\in X_{0}$ and all $k\in\mathbb{N}_{0}$ ,

[TABLE]

Assumption 5 is very general and it was assumed in Nedić [32]. For completeness we present next a lemma showing Assumption 5 holds in the relevant case in which the feasible set $X$ satisfies a standard metric regularity property, the number $|\mathcal{I}|$ of constraints is finite (and possibly very large) and an i.i.d. uniform sampling of the constraints is chosen.

Lemma 3 (Sufficient condition for Assumption 5).

Suppose $\{v^{k}\}$ and $\{\omega_{k}\}$ are independent sequences, $|\mathcal{I}|<\infty$ and the following hold:

(i)

The sequence $\{\omega_{k}\}$ is an i.i.d. sample of a random variable $\omega$ taking values on $\mathcal{I}$ such that for some $\lambda>0$ ,

[TABLE]

(ii)

The set $X$ is metric regular: there is $\eta>0$ such that for all $x\in X_{0}$ ,

[TABLE]

Then Assumption 5 holds with $c=\frac{\eta|\mathcal{I}|}{\lambda}$ .

Proof.

Since $\{v^{k}\}$ and $\{\omega_{k}\}$ are independent and the $\{\omega_{k}\}$ ’s are i.i.d., we have that for all $k\in\mathbb{N}_{0}$ , $\omega_{k}$ is independent of $\mathcal{F}_{k}$ . Hence for all $k\in\mathbb{N}_{0}$ and $x\in X$ ,

[TABLE]

using the fact that $\omega_{k}$ has the same distribution as $\omega$ in the second equality, Lemma 3(i) in the first inequality and Lemma 3(ii) in the last inequality. ∎

Item (i) above is satisfied when $\omega$ is uniform over $\mathcal{I}$ , i.e., $\mathbb{P}(\omega=i)=1/|\mathcal{I}|$ for all $i\in\mathcal{I}$ . As an example, item (ii) in Lemma 3 is satisfied for any compact convex set under a Slater condition, as proved by Robinson (see Pang [36]). A particular case of item (ii) occurs when for some $\eta>0$ and all $x\in\mathbb{R}^{n}$ ,

[TABLE]

In this case $g_{i}:=\operatorname*{d}(\cdot,X_{i})$ for $i\in\mathcal{I}$ and the method (33)-(34) may be rewritten as (15)-(16) assuming easy projections onto the soft constraints. Condition (36) is called linear regularity; see Bauschke and Borwein [4], Deutsch and Hundal [16]. As proved by Hoffman, (36) is satisfied for any polyhedron (see Pang [36]).

Assumption 6 (Small stepsizes).

For all $k\in\mathbb{N}$ , $\alpha_{k}>0$ , $\beta_{k}\in(0,2)$ , and

[TABLE]

We remark here that the use of small stepsizes is forced by two factors: the use of approximate projections instead of exact ones, and the stochastic approximation. Indeed, even with exact projections, the method (33)-(34) still requires small stepsizes in order to guarantee asymptotic convergence.

Finally we state the weak-sharpness property assumed only in this section.

Assumption 7 (weak sharpness).

There exists $\rho>0$ , such that for all $x^{*}\in X^{*}$ and all $x\in X$ ,

[TABLE]

3.3 Convergence analysis

We need the following lemma whose proof is immediate.

Lemma 4.

Suppose that Assumptions 3(i)-4 hold. Define the function $B:\mathbb{R}^{n}\rightarrow[0,\infty)$ as

[TABLE]

for any $x\in\mathbb{R}^{n}$ . Then, almost surely, for all $x,y\in\mathbb{R}^{n}$ , $k\in\mathbb{N}$ ,

[TABLE]

We now prove an iterative relation to be used in the convergence analysis. We mention that (40) is sufficient for the convergence analysis and includes the case of unbounded $X$ and $T$ . If the operator is bounded or $X_{0}$ is compact, then (42) allows an improvement of the convergence rate given in Section 3.4. In the following we define for all $x\in\mathbb{R}^{n}$ , $k\in\mathbb{N}$ and $\tau>1$ ,

[TABLE]

Lemma 5 (Recursive relations).

Suppose that Assumptions 1-7 hold.

If Assumption 3(i) holds, then for all $x^{*}\in X^{*}$ , $\tau>1$ and $k\in\mathbb{N}$ ,

[TABLE]

and

[TABLE]

If Assumption 3(ii) holds, then for all $x^{*}\in X^{*}$ , $\tau>1$ and $k\in\mathbb{N}$ ,

[TABLE]

and

[TABLE]

Proof.

Take $x^{*}\in X^{*}$ , $\tau>1$ and $k\in\mathbb{N}$ . We claim that

[TABLE]

Indeed, by the definition of the method (33)-(34), we can invoke Lemma 2 with $g:=g_{\omega_{k}}$ , $x_{1}:=x^{k}$ , $x_{2}:=x^{k+1}$ , $y:=y^{k}$ , $x_{0}:=x^{*}$ , $\alpha:=\alpha_{k}$ , $u:=F(v^{k},x^{k})$ , $\beta:=\beta_{k}$ and $d:=d^{k}$ , obtaining (44).

We now take the conditional expectation with respect to $\mathcal{F}_{k}$ in (44) obtaining,

[TABLE]

using $x^{k}\in\mathcal{F}_{k}$ and Assumption 4 in the first inequality, and Assumption 5 in the second inequality.

Next, we will bound the second term in the right hand side of (45). We write

[TABLE]

By monotonicity of $T$ (Assumption 2), the first term in the right hand side of (46) satisfies

[TABLE]

Regarding the second term in the right hand side of (46), the weak sharpness property (Assumption 7) and the fact that $x\in X^{*}$ imply

[TABLE]

We now observe that $\left|\operatorname*{d}\left(\Pi_{X}(x^{k}),X^{*}\right)-\operatorname*{d}(x^{k},X^{*})\right|\leq\|\Pi_{X}(x^{k})-x^{k}\|=\operatorname*{d}(x^{k},X),$ so that

[TABLE]

From (48)-(49), we get

[TABLE]

Concerning the third term in the right hand side of (46), we have

[TABLE]

using Cauchy-Schwarz inequality in the first inequality, and the definition of $B(x^{*})$ in Lemma 4 in the second inequality. Combining (47), (50) and (51) with (46), we finally get

[TABLE]

We use (52) in (45) and get

[TABLE]

From Lemma 4 and the fact that $x^{k}\in\mathcal{F}_{k}$ , we obtain

[TABLE]

Now we rearrange the last two terms in the right hand side of (53), using the fact that $2ab\leq\lambda a^{2}+\frac{b^{2}}{\lambda}$ for any $\lambda>0$ . With $a:=\operatorname*{d}(x^{k},X)$ , $b:=\mathsf{C}(x^{*})\alpha_{k}$ and $\lambda:=\mathsf{A}_{k,\tau}$ we get

[TABLE]

Putting together relations (53)-(55) and rearranging terms, we finally get (40), as requested.

Alternatively, we can replace (55) by the bound

[TABLE]

using the fact that $2ab\leq\lambda a^{2}+\frac{b^{2}}{\lambda}$ with $a:=\operatorname*{d}(x^{k},X)$ , $b:=\mathsf{C}(x^{*})\alpha_{k}$ and $\lambda:=\mathsf{A}_{k,\tau}/2$ . Putting together relations (53)-(54) and (56) and rearranging terms, we get (41), as requested.

Suppose now that Assumption 3(ii) holds. In this case, the inequalities in (51) can be replaced by

[TABLE]

using Assumption 3(ii) and the fact that $\|T(x^{*})\|^{2}\leq\mathbb{E}\big{[}\|F(v^{k},x^{*})\|^{2}\big{|}\mathcal{F}_{k}\big{]}\leq 2C_{F}^{2}$ , which follows from Jensen’s inequality, in the last inequality. Hence, combining (47), (50) and (57) we get, instead of (52),

[TABLE]

Using Assumption 3(ii) and (58) in (45) we get

[TABLE]

In view of Assumption 1, we define $\bar{x}^{k}:=\Pi_{X^{*}}(x^{k})$ . Note that $\bar{x}^{k}\in\mathcal{F}_{k}$ because $\Pi_{X^{*}}$ is continuous and $x^{k}\in\mathcal{F}_{k}$ . From (59) we get

[TABLE]

using $x^{k},\bar{x}^{k}\in\mathcal{F}_{k}$ , $\|x^{k}-\bar{x}^{k}\|=\operatorname*{d}(x^{k},X^{*})$ and (59) in the second inequality. We rearrange now the last two terms in the right hand side of (60) (in a way similar to (55) or (56)), and obtain (42) or (43). ∎

Theorem 4 (Asymptotic convergence).

Under Assumptions 1-7, method (33)-(34) generates a sequence $\{x^{k}\}$ which a.s. is bounded and $\lim_{k\rightarrow\infty}\operatorname*{d}(x^{k},X^{*})=0.$ In particular, a.s. all cluster points of $\{x^{k}\}$ belong to $X^{*}$ .

Proof.

We begin by imposing Assumption 3(i). Choose some $x^{*}\in X^{*}$ (Assumption 1) and $\tau>1$ . By Assumption 6 and the definitions given in Lemma 5, we have $\sum_{k}\alpha_{k}^{2}<\infty$ , $\sum_{k}\alpha_{k}^{2}\mathsf{A}_{k,\tau}^{-1}<\infty$ and $0<\mathsf{B}_{k}\tau\leq\tau$ , since $\beta_{k}(2-\beta_{k})\in(0,1]$ , for $\beta_{k}\in(0,2)$ for all $k$ . Hence, we can invoke (40) in Theorem 1 in order to to conclude that, a.s., $\{\|x^{k}-x^{*}\|\}$ converges and, in particular, $\{x^{k}\}$ is bounded.

In view of Assumption 1, we can define $\bar{x}^{k}:=\Pi_{X^{*}}(x^{k})$ . We have $\bar{x}^{k}\in\mathcal{F}_{k}$ because $x^{k}\in\mathcal{F}_{k}$ and $\Pi_{X^{*}}$ is continuous. Since (40) in Lemma 5 holds for any $x^{*}\in X^{*}$ and $\operatorname*{d}(x^{k},X^{*})=\|x^{k}-\bar{x}^{k}\|$ , we conclude that for all $k\in\mathbb{N}$ ,

[TABLE]

using relation (40) and $\bar{x}^{k}\in\mathcal{F}_{k}$ in the second inequality.

We observe that the function $B:X^{*}\rightarrow\mathbb{R}_{+}$ defined in Lemma 4 is locally bounded because $T$ is continuous. Using this fact, the continuity of $\Pi_{X^{*}}$ , the a.s.-boundedness of $\{x^{k}\}$ and $\bar{x}^{k}=\Pi_{X^{*}}(x^{k})$ , we conclude that $\{B(\bar{x}^{k})\}$ and $\{\mathsf{C}(\bar{x}^{k})\}$ are a.s.-bounded. From the a.s.-boundedness of $\{B(\bar{x}^{k})\}$ and $\{\mathsf{C}(\bar{x}^{k})\}$ and the conditions $\sum_{k}\alpha_{k}^{2}<\infty$ , $\sum_{k}\alpha_{k}^{2}\mathsf{A}_{k,\tau}^{-1}<\infty$ and $0<\mathsf{B}_{k}\tau\leq\tau$ for all $k$ , which hold by Assumption 6, we conclude from Theorem 1 and (61) that a.s. $\{\operatorname*{d}^{2}(x^{k},X^{*})\}$ converges, and

[TABLE]

By Assumption 6, we also have that $\sum_{k}\alpha_{k}=\infty$ , so that the above relation implies a.s. $\liminf_{k\rightarrow\infty}\operatorname*{d}(x^{k},X^{*})=0.$ In particular, the sequence $\{\operatorname*{d}(x^{k},X^{*})\}$ has a subsequence that converges to zero almost surely. Since $\{\operatorname*{d}(x^{k},X^{*})\}$ a.s. converges, we conclude that the whole sequence a.s. converges to [math]. The proof under Assumption 3(ii) is similar, using (42). ∎

3.4 Convergence rate analysis

In this subsection we present convergence rate results for the method (33)-(34) under the weak sharpness property (37). The solvability metric will be $\operatorname*{d}(\cdot,X^{*})$ while the feasibility metric will be $\operatorname*{d}(\cdot,X)^{2}$ . We define, for $\ell\leq k$ ,

[TABLE]

where $\widehat{x}^{k}$ is the ergodic average of the iterates and $\widehat{x}^{k}_{\ell}$ is the window-based ergodic average of the iterates when the stepsizes $\{\alpha_{k}\}$ are used to compute the weights. The solvability metric will be given in terms of $\widehat{x}^{k}$ or $\widehat{x}^{k}_{\ell}$ . The definitions of $\widetilde{x}^{k}$ and $\widetilde{x}^{k}_{\ell}$ are analogous, but using $\mathsf{B}_{k}=\beta_{k}(2-\beta_{k})$ for computing the weights. The feasibility metric will be given in terms of such ergodic averages.

In order to obtain convergence rates for the case of an unbounded feasible set $X$ or unbounded constraint sets $\{X_{0}\}\cup\{X_{i}:i\in\mathcal{I}\}$ , we shall need the following proposition, which ensures that the sequence is bounded in $L^{2}$ . A typical situation is the case in which $X$ is a polyhedron, i.e. $X_{0}=\mathbb{R}^{n}$ and the selected constraints $\{X_{i}\}_{i\in\mathcal{I}}$ are halfspaces, which have easily computable projections but are unbounded sets. If the uniform bound of Assumption 3(ii) holds, then sharper bounds are given in (68). We shall define for $\tau>1$ ,

[TABLE]

and for $\ell\leq k$ ,

[TABLE]

Proposition 2 (Boundedness in $L^{2}$ ).

Suppose that Assumptions 1-7 hold.

Under Assumption 3(i), choose $\tau>1$ , $k_{0}\in\mathbb{N}$ and $0<\gamma<\frac{1}{2(1+\tau)L^{2}}$ such that

[TABLE]

Then for all $x^{*}\in X^{*}$ ,

[TABLE]

If Assumption 3(ii) holds, then for all $k\in\mathbb{N}$ ,

[TABLE]

Proof.

We first prove (67) under Assumption 3(i). Recall the definitions of $\mathsf{A}_{k,\tau}$ and $\mathsf{B}_{k}\tau$ in (39). By Assumption 6, we can choose $k_{0}\in\mathbb{N}$ and $\gamma>0$ such that (66) holds. Observe that $\beta_{k}(2-\beta_{k})\in(0,1]$ , because $\beta_{k}\in(0,2)$ , so that

[TABLE]

Fix $x\in X^{*}$ and $\tau>1$ . Define

[TABLE]

For any $k>k_{0}$ , we take the total expectation and sum (40) from $k_{0}$ to $k-1$ , obtaining

[TABLE]

Given an arbitrary $a>z_{k_{0}}^{1/2}$ , define

[TABLE]

Suppose first that $\Gamma_{a}<\infty$ for all $a>z_{k_{0}}^{1/2}$ . Then by (66), (69) and (70) we get

[TABLE]

using the fact that $\beta_{i}(2-\beta_{i})\in(0,1]$ in the definition of $\mathsf{B}_{i}\tau$ , and the definitions of $\mathsf{A}_{i,\tau}$ , $D_{i}^{2}$ and $D^{2}$ . Hence

[TABLE]

using the fact that $0<\gamma<[2(1+\tau)L^{2}]^{-1}$ . Since $a>z_{k_{0}}^{1/2}$ is arbitrary, it follows that

[TABLE]

using again the fact that $0<\gamma<[2(1+\tau)L^{2}]^{-1}$ . In view of (70)-(71), we have a contradiction with the assumption that $\Gamma_{a}<\infty$ for any $a>z_{k_{0}}^{1/2}$ . Hence, there exists some $\bar{a}>z_{k_{0}}^{1/2}$ such that $\Gamma_{\bar{a}}=\infty$ , so that the set in the right hand side of (70) is empty. In this case we have $\sup_{k\geq k_{0}}z_{k}\leq\bar{a}^{2}<\infty$ . If $\sup_{k\geq k_{0}}z_{k}=z_{k_{0}}$ , then (67) holds trivially, since $1-\mathsf{H}_{\tau}L^{2}\gamma\in(0,1)$ . Otherwise, $\hat{a}:=(\sup_{k\geq k_{0}}z_{k})^{1/2}>z_{k_{0}}^{1/2}$ . From (66), (69), $\beta_{i}\in(0,2)$ and the definitions of $\mathsf{A}_{i,\tau}$ , $\mathsf{B}_{i}\tau$ , $D_{i}^{2}$ and $D$ , we have for all $k\geq k_{0}$ ,

[TABLE]

implying that $\hat{a}^{2}=\sup_{k\geq k_{0}}z_{k}\leq z_{k_{0}}+2(1+\tau)L^{2}\gamma\hat{a}^{2}+D^{2}\gamma,$ so that

[TABLE]

using again $0<\gamma<[2(1+\tau)L^{2}]^{-1}$ . From (72) and the definitions of $\mathsf{G}_{\tau}$ , $\mathsf{H}_{\tau}$ and $D$ , we conclude that (67) holds.

We now prove (68) under Assumption 3(ii). As before, we define

[TABLE]

Taking total expectation in (42) and summing from [math] to $k-1$ , we get

[TABLE]

for all $k\geq 0$ , using the fact that $\beta_{i}\in(0,2)$ and the definitions of $\mathsf{A}_{i,\tau}$ , $\mathsf{B}_{i}\tau$ , $\widehat{D}_{i}^{2}$ , $\mathsf{a}_{0}^{k-1}$ and $\mathsf{b}_{0}^{k-1}$ . We conclude from (73), the definitions of $\mathsf{G}_{\tau}$ and $\mathsf{H}_{\tau}$ , and the monotonicity of the sequences $\{\mathsf{a}_{0}^{k},\mathsf{b}_{0}^{k}\}$ that (68) holds. ∎

Next we will give convergence rate results for the original sequence $\{x^{k}\}$ and for the ergodic average sequences. We consider separately the cases of unbounded operators (Assumption 3(i)) and the case of bounded ones (Assumption 3(ii)), because in the later case sharper rates are possible. In the remainder of this subsection, we refer the reader to definitions (38), (39), (62)-(63) and (64)-(65).

Theorem 5 (Solvability and feasibility rates of convergence: unbounded case).

Suppose that Assumptions 1-7 and Assumption 3(i) hold. Choose $\tau>1$ , $k_{0}\in\mathbb{N}$ and $\phi\in(0,1)$ such that

[TABLE]

Define for $x^{*}\in X^{*}$ ,

[TABLE]

Then $\operatorname*{d}(x^{k},X^{*})$ a.s.-converges to [math] and the following holds:

a)

For any $\epsilon>0$ , there exists $M:=M_{\epsilon}\in\mathbb{N}$ , such that for all $x^{*}\in X^{*}$ , $\mathbb{E}\left[\operatorname*{d}(x^{M},X^{*})\right]<\epsilon\leq\mathsf{E}_{\infty}(x^{*},k_{0},1/2\rho,1)/\mathsf{S}_{0}^{M-1}.$

b)

For all $k\in\mathbb{N}$ and all $x^{*}\in X^{*}$ , $\mathbb{E}\left[\operatorname*{d}(\widehat{x}^{k},X^{*})\right]\leq\mathsf{E}_{k}(x^{*},k_{0},1/2\rho,1)/\mathsf{S}_{0}^{k}.$

c)

For any $\epsilon>0$ , there exists $N:=N_{\epsilon}\in\mathbb{N}$ , such that for all $x^{*}\in X^{*}$ , $\mathbb{E}\left[\operatorname*{d}(x^{N},X)^{2}\right]<\epsilon\leq\mathsf{E}_{\infty}(x^{*},k_{0},2\mathsf{G}_{\tau},2)/\mathsf{Z}_{0}^{N-1}.$

d)

For all $k\in\mathbb{N}$ and all $x^{*}\in X^{*}$ , $\mathbb{E}\left[\operatorname*{d}(\widetilde{x}^{k},X)^{2}\right]\leq\mathsf{E}_{k}(x^{*},k_{0},2\mathsf{G}_{\tau},2)/\mathsf{Z}_{0}^{k}.$

Proof.

Fix $\tau>1$ , $k_{0}\in\mathbb{N}$ and $\phi\in(0,1)$ as in (74). This is possible because $\sum_{i\geq k}\alpha_{i}^{2}\beta_{i}^{-1}(2-\beta_{i})^{-1}$ converges to [math] as $k\rightarrow\infty$ by Assumption 6. We now invoke Lemma 5. We take the total expectation in (40) and sum from $\ell$ to $k$ , obtaining, for every $x^{*}\in X^{*}$ ,

[TABLE]

using $\beta_{i}(2-\beta_{i})\in(0,1]$ and the definitions of $\mathsf{A}_{i,\tau}$ , $\mathsf{B}_{i}\tau$ , $\mathsf{G}_{\tau}$ , $\mathsf{H}_{\tau}$ , $\mathsf{a}_{\ell}^{k}$ and $\mathsf{b}_{\ell}^{k}$ in the last inequality.

We now invoke Proposition 2. Setting $\gamma:=\frac{\phi}{2(1+\tau)L^{2}}$ , (66) can be rewritten as (74). From (67) and $1-\mathsf{H}_{\tau}L^{2}\in(0,1)$ , we get, for all $x^{*}\in X^{*}$ ,

[TABLE]

using the definitions of $\mathsf{H}_{\tau}=2(1+\tau)$ , $\gamma$ and $\mathsf{I}(x^{*},k_{0})$ .

We prove now item (a). For every $\epsilon>0$ , define

[TABLE]

From the definition of $M$ we have, for every $k<M$ ,

[TABLE]

We claim that $M$ is finite. Indeed, if $M=\infty$ , then (77), (78) and (80) hold for $\ell:=0$ and all $k\in\mathbb{N}$ . Hence, letting $k\rightarrow\infty$ and using that $\mathsf{a}_{0}^{\infty}<\infty$ and $\mathsf{b}_{0}^{\infty}<\infty$ , which hold by Assumption 6, we obtain $\sum_{k}\alpha_{k}<\infty$ , which contradicts Assumption 6. Hence, the set in the right hand side of (79) is nonempty, which implies $\mathbb{E}[\operatorname*{d}(x^{M},X^{*})]<\epsilon$ . Setting $\ell:=0$ and $k:=M-1$ in (77), (78) and (80), we get for all $x^{*}\in X^{*}$ ,

[TABLE]

using the definition of $\mathsf{E}_{k}(x^{*},k_{0},1/2\rho,1)$ . We thus obtain item (a).

We now prove item (b). In view of the convexity of the function $x\longmapsto\operatorname*{d}(x,X^{*})$ , and the linearity and monotonicity of the expected value, we have

[TABLE]

Set $\ell:=0$ , divide (77) by $2\rho\sum_{i=0}^{k}\alpha_{i}=2\rho\mathsf{S}_{0}^{k}$ and use (81), the definition of $\mathsf{E}_{k}(x^{*},k_{0},1/2\rho,1)$ together with (78), in order to bound $\sup_{i\geq 0}\mathbb{E}[\|x^{i}-x^{*}\|^{2}]$ , and obtain item (b) as a consequence.

The proofs of items (c) and (d) follow the proofline above, using (41) instead of (40). ∎

Corollary 1 (Solvability and feasibility rates with robust stepsizes: unbounded case).

Assume that the hypotheses of Theorem 5 hold. Given $\theta>0$ and $\lambda>0$ , define $\{\alpha_{k}\}$ as: $\alpha_{0}=\alpha_{1}=\theta$ and for $k\geq 2$ ,

[TABLE]

and choose $\beta_{k}\equiv\beta\in(0,2)$ , $\tau>1$ and $\phi\in(0,1)$ . Take $k_{0}\geq 2$ as the minimum natural number such that

[TABLE]

Define

[TABLE]

Then $\operatorname*{d}(x^{k},X^{*})$ a.s.-converges to [math] and the following holds:

a)

For every $\epsilon>0$ , there exists $M=M_{\epsilon}\geq 2$ such that

[TABLE]

b)

For all $k\geq 2$ ,

[TABLE]

c)

For every $\epsilon>0$ , there exists $N=N_{\epsilon}\in\mathbb{N}$ such that

[TABLE]

d)

For all $k\in\mathbb{N}_{0}$ ,

[TABLE]

Proof.

We estimate $k_{0}$ in (76). Since

[TABLE]

we conclude from (74) that it is enough to choose the minimum $k_{0}\geq 2$ such that

[TABLE]

that is to say, the minimum $k_{0}\geq 2$ such that (83) holds.

Let $k\geq 2$ . We first estimate the sum of the stepsize sequence. For any $0\leq\ell\leq k$ we have

[TABLE]

using the fact that the minimum stepsize between $\ell$ and $k\geq 2$ is $\theta k^{-\frac{1}{2}}(\ln k)^{\frac{1+\lambda}{2}}$ . The sum of the squares of the stepsizes sequence can be estimated as

[TABLE]

We assume without loss on generality that we have $M\geq 2$ in (79). Item (a) follows from (84) with $k:=M-1$ and $\ell:=0$ , (85), Theorem 5(a) and the definitions of $\mathsf{J}_{\beta}(x^{*},k_{0},1)$ , $\mathsf{E}_{\infty}(x^{*},k_{0},1/2\rho,1)$ and $\mathsf{a}_{0}^{\infty}=\beta(2-\beta)\mathsf{b}_{0}^{\infty}$ .

Similarly, item (b) follows from (84)-(85) with $\ell:=0$ , Theorem 5(b) and the definitions of $\mathsf{J}_{\beta}(x^{*},k_{0},1)$ and $\mathsf{E}_{k}(x^{*},k_{0},1/2\rho,1)$ and the facts that $\mathsf{b}^{k}_{0}\leq\mathsf{b}_{0}^{\infty}$ and $\mathsf{a}_{0}^{\infty}=\beta(2-\beta)\mathsf{b}_{0}^{\infty}$ .

The proof of items (c) and (d) follows a similar proofline, using Theorem 5(c)-(d) and the fact that $\mathsf{Z}_{0}^{k}=\beta(2-\beta)(k+1)$ . ∎

Next we give convergence rates for the bounded case. For simplicity we just state the rates for the ergodic averages, but we note that similar rates can be derived for $x^{k}$ as in Theorem 5 and Corollary 1.

Theorem 6 (Solvability and feasibility rates: bounded case).

Suppose that Assumptions 1-7 and Assumption 3(ii) hold. Choose $\tau>1$ . Define for $\ell\leq k$ in $\mathbb{N}_{0}\cup\{\infty\}$ ,

[TABLE]

Then, $\operatorname*{d}(x^{k},X^{*})$ a.s.-converges to zero and

a)

For all $k\in\mathbb{N}$ , $\mathbb{E}\left[\operatorname*{d}(\widehat{x}^{k},X^{*})\right]\leq\mathsf{E}_{0}^{k}[\operatorname*{d}(x^{0},X^{*}),1/2\rho,1]/\mathsf{S}_{0}^{k}.$

b)

*If $X_{0}$ is compact, then for all $\ell,k\in\mathbb{N}$ with $\ell<k$ , $\mathbb{E}\left[\operatorname*{d}(\widehat{x}_{\ell}^{k},X^{*})\right]\leq\mathsf{E}_{\ell}^{k}[\operatorname*{\mathcal{D}}(X_{0}),1/2\rho,1]/\mathsf{S}_{\ell}^{k}.$ *

c)

for all $k\in\mathbb{N}$ , $\mathbb{E}\left[\operatorname*{d}(\widetilde{x}^{k},X)^{2}\right]\leq\mathsf{E}_{0}^{k}[\operatorname*{d}(x^{0},X^{*}),2\mathsf{G}_{\tau},2]/\mathsf{Z}_{0}^{k}.$

d)

If $X_{0}$ is compact, then for all $\ell,k\in\mathbb{N}$ with $\ell<k$ , $\mathbb{E}\left[\operatorname*{d}(\widetilde{x}_{\ell}^{k},X)^{2}\right]\leq\mathsf{E}_{\ell}^{k}[\operatorname*{\mathcal{D}}(X_{0}),2\mathsf{G}_{\tau},2]/\mathsf{Z}_{\ell}^{k}.$

Proof.

Fix $\tau>1$ . We will invoke Lemma 5. We take the total expectation in (42) and sum from $\ell$ to $k$ , obtaining

[TABLE]

using the fact that $\beta_{i}(2-\beta_{i})\in(0,1]$ and the definitions of $\mathsf{A}_{i,\tau}$ , $\mathsf{B}_{i}\tau$ , $\mathsf{G}_{\tau}$ , $\mathsf{H}_{\tau}$ , $\mathsf{a}_{\ell}^{k}$ and $\mathsf{b}_{\ell}^{k}$ in last inequality. From (86) on, the proofs of items (a)-(b) are similar to the proof of Theorem 5. We omit the details, but make the following remarks: differently to the proofs of items (a)-(b) in Theorem 5, the proofs of items (a)-(b) of Theorem 6 do not require Proposition 2. In the proof of item (b), we use the bound $\mathbb{E}[\operatorname*{d}(x^{\ell},X^{*})^{2}]\leq\operatorname*{\mathcal{D}}(X_{0})^{2}$ in (86). The proofs of items (c)-(d) follow a similar proofline, using (43). ∎

Corollary 2 (Solvability and feasibility rates with robust stepsizes: bounded case).

Assume that the hypotheses of Theorem 6 hold. Given $\theta>0$ and $\lambda>0$ , define $\{\alpha_{k}\}$ as: $\alpha_{0}=\alpha_{1}=\theta$ and for $k\geq 2$ ,

[TABLE]

and choose $\beta_{k}\equiv\beta\in(0,2)$ , $\tau>1$ . Define

[TABLE]

Then $\operatorname*{d}(x^{k},X^{*})$ a.s.-converges to [math] and

a)

for all $k\geq 2$ ,

[TABLE]

b)

if $X_{0}$ is compact, then given $r\in(0,1)$ , for all $k\geq 2r^{-1}$ , it holds that

[TABLE]

c)

For all $k\in\mathbb{N}_{0}$ ,

[TABLE]

Proof.

Item (a) follows from (84)-(85) with $\ell:=0$ , Theorem 6(a), the definition of $\mathsf{\widehat{J}}_{\beta}[1]$ , $\mathsf{E}_{0}^{k}[\operatorname*{d}(x^{0},X^{*}),1/2\rho,1]$ and the facts that $\mathsf{b}^{k}_{0}\leq\mathsf{b}_{0}^{\infty}$ and $\mathsf{a}_{0}^{\infty}=\beta(2-\beta)\mathsf{b}_{0}^{\infty}$ .

The proof of item (c) follows a similar proofline, using Theorem 6(c) and $\mathsf{Z}_{0}^{k}=\beta(2-\beta)(k+1)$ .

We now prove item (b). Let $r\in(0,1)$ , $k\geq 2r^{-1}$ and set $\ell:=\lceil rk\rceil$ . We have $\ell\geq 2$ and $rk\leq\ell\leq rk+1$ . We estimate

[TABLE]

From (84) and (88) we have

[TABLE]

using the inequality $\ell\geq rk$ in the second inequality of (89) and $k-\ell+1\geq(1-r)k$ in the second inequality of (90). Item (b) follows from (89)-(90), Theorem 6(b), the definition of $\mathsf{\widehat{J}}_{\beta}[1]$ and $\mathsf{E}^{k}_{\ell}[\operatorname*{\mathcal{D}}(X_{0}),1/2\rho,1]$ and the fact that $\beta(2-\beta)\mathsf{b}^{k}_{\ell}=\mathsf{a}^{k}_{\ell}$ . ∎

Remark 2.

Corollary 2(b) implies that, if $X_{0}$ is compact, then $\operatorname*{d}(\widehat{x}^{k}_{\lceil rk\rceil},X^{*})$ has a better performance than $\operatorname*{d}(x^{k},X^{*})$ and $\operatorname*{d}(\widehat{x}^{k},X^{*})$ when stepsizes as in (87) are used. Indeed, in Corollary 2(c), $\lambda>0$ can be arbitrarily small, without affecting the constant in the convergence rate, and the “stochastic error” $r^{-1}\mathsf{\widehat{J}}_{\beta}[1]\left[\ln k-\ln(1/r)\right]^{-(1+\lambda)}$ decays to zero. For unbounded operators, (83) in Corollary 1 suggests the use of $\lambda>1$ and $\theta\sim L$ so that $k_{0}$ does not become too large. As an example, if $\tau=1.5$ , $\theta=L$ , $\beta=1$ , $\phi=0.5$ and $\lambda=2$ , we have $k_{0}=11$ . For simplicity we do not state the analogous result for $\operatorname*{d}(\widetilde{x}^{k}_{\lceil rk\rceil},X)^{2}$ .

In Corollaries 1-2, stepsizes $\{\alpha_{k}\}$ of $O(1)k^{-1/2}(\ln k)^{-(1+\lambda)/2}$ are small enough to guarantee asymptotic a.s.-convergence and large enough as to ensure a rate of $O(1)k^{-1/2}(\ln k)^{(1+\lambda)/2}$ . If asymptotic a.s.-convergence of the whole sequence is not the main concern, we show next that one may use larger stepsizes of $O(1)k^{-1/2}$ for ensuring convergence in $L^{1}$ (hence convergence in probability and a.s.-convergence of a subsequence) with a convergence rate of $O(1)k^{-1/2}$ . When a constant stepsize $\alpha$ is used in method (33)-(34), we can also give an error bound on the performance proportional to $\alpha$ . Precisely, for fixed $\beta\in(0,2)$ , we have $\mathbb{E}[\operatorname*{d}(\widehat{x}^{k},X^{*})]\lesssim k^{-1}+O(\alpha)$ and $\mathbb{E}[\operatorname*{d}(\widetilde{x}^{k},X)^{2}]\lesssim k^{-1}+O(\alpha^{2})$ . Such error bounds rigorously justify the practical use of constant stepsizes in incremental methods for machine learning, where only an inexact solution is required.

Corollary 3 (Solvability and feasibility rates for larger stepsizes: bounded case).

Assume that the hypotheses of Theorem 6 hold. Recall the definition of $\mathsf{\widehat{J}}_{\beta}[\cdot]$ in Corollary 2. Choose $\theta>0$ , $\beta_{k}\equiv\beta\in(0,2)$ and $\tau>1$ .

a)

If we choose a constant stepsize $\alpha_{k}\equiv\theta\alpha$ , then for all $k\geq 1$ ,

[TABLE]

b)

If the total number of iterations $\mathsf{K}\geq 1$ is given a priori and for all $k\in[\mathsf{K}]$ , $\alpha_{k}\equiv\frac{\theta}{\sqrt{\mathsf{K}+1}},$ then

[TABLE]

c)

If $X_{0}$ is compact and we choose $\alpha_{0}:=\theta$ and for $k\geq 1$ , $\alpha_{k}:=\frac{\theta}{\sqrt{k}},$ then, given $r\in(0,1)$ , for all $k\geq r^{-1}$ ,

[TABLE]

Proof.

Item (a) follows from Theorem 6(a) and (c) and the definitions of $\mathsf{\widehat{J}}_{\beta}[\cdot]$ , $\mathsf{E}^{k}_{0}[\cdot]$ , $\mathsf{S}_{0}^{k}$ , $\mathsf{Z}_{0}^{k}$ , $\mathsf{a}_{0}^{k}$ and $\mathsf{b}_{0}^{k}$ . Item (b) follows from item (a). We prove now item (c). Take $r\in(0,1)$ , $k\geq r^{-1}$ and set $\ell:=\lceil rk\rceil$ . We have $\ell\geq 1$ and $rk\leq\ell\leq rk+1$ . We estimate

[TABLE]

using the fact that the minimum stepsize between $\ell$ and $k\geq 2$ is $\theta k^{-\frac{1}{2}}$ . We also estimate

[TABLE]

From (91)-(92) we have

[TABLE]

using $\ell\geq rk$ and $k-\ell+1\geq(1-r)k$ . Item (c) follows from (93)-(94), Theorem 6(b) and (d), the definitions of $\mathsf{\widehat{J}}_{\beta}[\cdot]$ and $\mathsf{E}_{\ell}^{k}[\cdot]$ and the fact that $\beta(2-\beta)\mathsf{b}^{k}_{\ell}=\mathsf{a}^{k}_{\ell}$ . ∎

We make a remark concerning the robustness of the stepsize sequence in Corollaries 1, 2 and 3 in the spirit of Nemirovski et al. [34]. The stepsizes presented above are robust in the sense that the knowledge of $L$ is not required and does not interrupt the advance of the method. Also, a scaling of $\theta$ in the stepsize implies a scaling in the convergence rate which is linear in $\max\{\theta,\theta^{-1}\}$ or $\max\{\theta^{2},1\}$ . Note that these properties hold true in the case of an unbounded operator with approximate projections.

We close this section by showing that, in the case of stochastic approximation, the weak sharpness property implies that after a finite number of iterations an auxiliary stochastic program with linear objective solves the original variational inequality. This recovers a similar property satisfied in the deterministic setting (see Marcotte and Zhu [31], Theorem 5.1). We estimate the minimum number of iterations in terms of the condition number $L/\rho^{2}$ , the variance and the distance of $x^{0}$ to the solution set, when $T$ is $L$ -Lipschitz continuous.

We emphasize that the auxiliary problem is still stochastic, an hence, even when $X$ is a polyhedron, we cannot conclude that a finite number of steps of a linear programming algorithm will be enough for finding a solution. It is not clear that switching to an SAA method for stochastics LP’s will be computationally more eficcient than continuing with our algorithm. Such issue requires extensive computational experimentation, which we intend to perform in a future work. Thus, for the time being we look at the next corollary as a possibly interesting theoretical property of weak-sharp SVI’s, i.e. an extension to the stochastic setting of Theorem 4.2 of [31].

Corollary 4 (A stochastic optimization problem).

Suppose that $T$ is $(L,\delta)$ -Hölder continuous with $\delta\in(0,1]$ and

the assumptions of Corollary 1 hold with $\delta=1$ (unbounded case), or 2. 2.

the assumptions of Corollary 2 hold (bounded case).

Then, there exists $\mathsf{V}>0$ , such that for all $k\geq 2$ with $\frac{k}{(\ln k)^{1+\lambda}}>\left(\frac{\mathsf{V}L^{1/\delta}}{\rho^{1+1/\delta}}\right)^{2},$ we have

[TABLE]

Moreover, under condition 1,

[TABLE]

while, under condition 2,

[TABLE]

Proof.

Call $\bar{x}^{k}:=\Pi_{X^{*}}(\widehat{x}^{k})$ . By the choice of $k$ , the definition of $\mathsf{V}$ and Corollaries 1(b) and 2(a), we have

[TABLE]

From the Hölder-continuity of $T$ ,

[TABLE]

using Jensen’s inequality in the first inequality, Hölder’s inequality in third inequality and (95) in last inequality.

From Proposition 1, Assumption 7 and the equivalence between (25) and (26), we get that the Euclidean ball of center $-T(\bar{x}^{k})$ and radius $\rho$ is contained in $\bigcap_{x\in X^{*}}[\mathbb{T}_{X}(x)\cap\mathbb{N}_{X^{*}}(x)]^{\circ}.$ By the convexity of the ball and Jensen’s inequality, we have

[TABLE]

From (96) and (97) we get that $-\mathbb{E}[T(\widehat{x}^{k})]\in\operatorname*{int}\big{(}\bigcap_{x\in X^{*}}[\mathbb{T}_{X}(x)\cap\mathbb{N}_{X^{*}}(x)]^{\circ}\big{)}$ . Hence we conclude from Theorem 3 that

[TABLE]

Finally, we observe that $\mathbb{E}\left[T(\widehat{x}^{k})\right]=\mathbb{E}\left[\mathbb{E}\left[F(v,\widehat{x}^{k})\big{|}\mathcal{F}_{k}\right]\right]=\mathbb{E}[F(v,\widehat{x}^{k})],$ using Assumption 4, $\widehat{x}^{k}\in\mathcal{F}_{k}$ and the property $\mathbb{E}[\mathbb{E}[\cdot|\mathcal{F}_{k}]]=\mathbb{E}[\cdot]$ . The results follows from $\mathbb{E}\left[T(\widehat{x}^{k})\right]=\mathbb{E}[F(v,\widehat{x}^{k})]$ and (98). ∎

4 An incremental projection method with regularization for Cartesian SVI

In this section we shall study incremental projections, dropping the weak sharpness property of Section 3 and assuming only monotonicity of the operator. Additionally, we analyze the distributed version of the method, which includes the centralized case ( $m=1$ ) in particular. For the sake of clarity, we present next the Cartesian and constraint structures in such framework.

4.1 Cartesian structure

We assume in this section that the stochastic variational inequality (1)-(2) has a Cartesian structure. We consider the decomposition $\mathbb{R}^{n}=\mathbb{R}^{n_{1}}\times\cdots\times\mathbb{R}^{n_{m}},$ with $n=n_{1}+\ldots+n_{m}$ and furnish this Cartesian space with the standard inner product $\langle x,y\rangle=\sum_{j=1}^{m}\langle x_{j},y_{j}\rangle,$ for $x=(x_{1},\ldots,x_{m})$ and $y=(y_{1},\ldots,y_{m})$ . We suppose that the feasible set $X\subset\mathbb{R}^{n}$ has the form $X=X^{1}\times\cdots\times X^{m},$ where each component $X^{j}\subset\mathbb{R}^{n_{j}}$ is a closed and convex set for $j\in[m]$ . We emphasize that the orthogonal projection under a Cartesian structure is simple: for $x=(x_{1},\ldots,x_{m})\in\mathbb{R}^{n}$ and $Y=Y^{1}\times\ldots\times Y^{m}\subset\mathbb{R}^{n}$ with $x_{j}\in\mathbb{R}^{n_{j}}$ and $Y^{j}\subset\mathbb{R}^{n_{j}}$ , we have $\Pi_{Y}(x)=(\Pi_{Y^{1}}(x_{1}),\ldots,\Pi_{Y^{m}}(x_{m})).$

We assume the random variable takes the form $v=(v_{1},\ldots,v_{m}):\Omega\rightarrow\Xi$ , where $v_{j}$ corresponds to the randomness of agent $j$ , the random operator $F:\Xi\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ has the form $F(v,x)=(F_{1}(v_{1},x),\ldots,F_{m}(v_{m},x)),$ with $F_{j}(v_{j},\cdot):\mathbb{R}^{n}\rightarrow\mathbb{R}^{n_{j}}$ for $j\in[m]$ . From (2), the mean operator has the form $T=(T_{1},\ldots,T_{m})$ with $T_{j}(x)=\mathbb{E}[F_{j}(v_{j},x)]$ for $j\in[m]$ . Such framework includes stochastic multi-agent optimization and stochastic Nash equilibrium problems as special cases.

4.2 Constraint structure

In order to exploit the use of incremental projections (as in Section 3) in the Cartesian framework, we assume from now on that for $j\in[m]$ , each Cartesian component $X^{j}$ of $X=X^{1}\times\ldots\times X^{m}$ has the following form:

[TABLE]

where $\{X_{0}^{j}\}\cup\{X_{i}^{j}:i\in\mathcal{I}_{j}\}$ is a collection of closed and convex subsets of $\mathbb{R}^{n_{j}}$ . Given $j\in[m]$ , we assume that the projection operator onto $X_{0}^{j}$ is computationally easy to evaluate, and that for every $i\in\mathcal{I}_{j}$ , $X_{i}^{j}$ is representable in $\mathbb{R}^{n_{j}}$ as

[TABLE]

for some convex function $g_{i}(j|\cdot):\mathbb{R}^{n_{j}}\rightarrow\mathbb{R}\cup\{\infty\}$ with domain $\operatorname*{dom}g_{i}(j|\cdot)\subset X_{0}^{j}$ . We denote the positive part of $g_{i}(j|\cdot)$ as $g^{+}_{i}(j|x):=\max\{g_{i}(j|x),0\},$ for $x\in\mathbb{R}^{n_{j}}$ . We also assume that, for every $i\in\mathcal{I}_{j}$ , the subgradients of $g^{+}_{i}(j|\cdot)$ at points $x\in X_{0}^{j}-X_{i}^{j}$ are easily computable and that $\{\partial g^{+}_{i}(j|\cdot):i\in\mathcal{I}_{j}\}$ is uniformly bounded over $X_{0}^{j}$ , i.e., there exists $C_{g}^{j}>0$ such that

[TABLE]

for all $x\in X_{0}^{j}$ , all $i\in\mathcal{I}_{j}$ and all $d\in\partial g^{+}_{i}(j|x)$ .

4.3 Statement of the algorithm

For problems endowed with the Cartesian structure and the constraint structure of Sections 4.1 and 4.2, our method advances in a distributed fashion for each Cartesian component $j\in[m]$ , as in the incremental projection method (33)-(34) with an additional Tykhonov regularization (in order to cope with the plainly monotone case). Precisely, fix the Cartesian component $j\in[m]$ . In a first stage, given the current iterate $x^{k}$ , the method advances in the direction $-F_{j}(v^{k}_{j},x^{k})-\epsilon_{k,j}x^{k}_{j}$ with stepsize $\alpha_{k,j}$ , after taking the sample $v^{k}_{j}$ and choosing the regularization parameter $\epsilon_{k,j}>0$ , producing an auxiliary iterate $y^{k}_{j}$ . In the second stage, a soft constraint $X_{\omega_{k,j}}^{j}$ is randomly chosen with the random control $\omega_{k,j}\in\mathcal{I}_{j}$ , and the method advances in the direction opposite to a subgradient of $g^{+}_{\omega_{k,j}}(j|\cdot)$ at the point $y^{k}_{j}$ with a stepsize $\beta_{k,j}$ , producing the next iterate $x^{k+1}_{j}$ . The iterates are collected in $x^{k+1}$ and the method continues. Formally, the method takes the form:

Algorithm 2 (Regularized incremental projection method: distributed case).

Initialization:* Choose the initial iterate $x^{0}\in\mathbb{R}^{n}$ , the stepsize sequences $\alpha^{k}=(\alpha_{k,1},\ldots,\alpha_{k,m})\in(0,\infty)^{m}$ and $\beta^{k}=(\beta_{k,1},\ldots,\beta_{k,m})\in(0,2)^{m}$ , the regularization sequence $\epsilon^{k}=(\epsilon_{k,1},\ldots,\epsilon_{k,m})\in(0,\infty)^{m}$ , the random control sequence $\omega^{k}=(\omega_{k,1},\ldots,\omega_{k,m})\in\mathcal{I}_{1}\times\ldots\times\mathcal{I}_{m}$ and the operator sample sequence $v^{k}=(v^{k}_{1}\,\ldots,v^{k}_{m})$ .* 2. 2.

Iterative step:* Given $x^{k}=(x^{k}_{1},\ldots,x^{k}_{m})$ , define, for each $j\in[m]$ ,*

[TABLE]

where $d^{k}_{j}\in\partial g^{+}_{\omega_{k,j}}(j|y^{k}_{j})-\{0\}$ if $g_{\omega_{k,j}}(j|y^{k}_{j})>0$ , and $d^{k}_{j}=d$ for any $d\in\mathbb{R}^{n_{j}}-\{0\}$ if $g_{\omega_{k,j}}(j|y^{k}_{j})\leq 0$ .

The first stage (102) of the iterative step can be written compactly as

[TABLE]

where $X_{0}:=X_{0}^{1}\times\ldots\times X_{0}^{m},$

[TABLE]

with $\varsigma^{k}:=(\varsigma_{j}^{k})_{j=1}^{m}$ and $D(\alpha)$ denotes the block-diagonal matrix in $\mathbb{R}^{n\times n}$ defined as

[TABLE]

with $\alpha=(\alpha_{1},\ldots,\alpha_{m})\in\mathbb{R}^{m}_{>0}$ , and $I_{n_{j}}\in\mathbb{R}^{n_{j}\times n_{j}}$ denoting the identity matrix for each $j\in[m]$ .

4.4 Discussion of the assumptions

We consider the natural filtration

[TABLE]

Assumption 8.

We request Assumptions 1-4 and Assumption 3(i).

In this section we avoid the weak sharpness property assumed in Section 3. We now state the assumptions concerning the approximate projections which accommodate the Cartesian structure. In simple terms, we require each Cartesian component $X^{j}$ given by (99) to satisfy Assumption 5 of Section 3. This is formally stated in Assumption 9. Also, the agents’ stepsizes and regularization sequences require a partial coordination specified in Assumption 10.

Assumption 9 (Constraint sampling and regularity).

For each $j\in[m]$ , there exists $c^{j}>0$ , such that a.s. for all $k\in\mathbb{N}$ and all $x\in X^{j}_{0}$ ,

[TABLE]

We observe that Assumption 9 requires a sampling coordination between the control sequences $\{\omega_{k,j}\}_{k=0}^{\infty}$ for $j\in[m]$ , since the filtration $\mathcal{F}_{k}$ accumulates the history of the control sequence of every Cartesian component. The next lemma shows that this assumption is immediately satisfied if each agent has a metric regular decision set and the constraint sampling is independent between agents and uniform i.i.d. for each agent.

Lemma 6 (Sufficient condition for Assumption 9).

Suppose that $\{v^{k}\}$ and $\{\omega_{k}\}$ are independent sequences, $\omega_{k,1},\ldots,\omega_{k,m}$ are independent for each $k$ and the following conditions hold: for each $j\in[m]$ , $|\mathcal{I}^{j}|<\infty$ and

(i)

The sequence $\{\omega_{k,j}\}_{k=0}^{\infty}$ is an i.i.d. sample of a random variable $\omega^{j}$ taking values on $\mathcal{I}^{j}$ such that for some $\lambda^{j}>0$ ,

[TABLE]

(ii)

The set $X^{j}$ is metric regular: there is $\eta^{j}>0$ such that for all $x\in X_{0}^{j}$ ,

[TABLE]

Then Assumption 9 holds with $c^{j}=\frac{\eta^{j}|\mathcal{I}^{j}|}{\lambda^{j}}$ for $j\in[m]$ .

Proof.

Since $\{v^{k}\}$ and $\{\omega_{k}\}$ are independent, $\{\omega_{k}\}_{k=0}^{\infty}$ is independent and $\omega_{k,1},\ldots,\omega_{k,m}$ are independent for each $k$ , it follows that for all $k\in\mathbb{N}_{0}$ and $j\in[m]$ , $\omega_{k,j}$ is independent of $\mathcal{F}_{k}$ . The remainder of the proof follows the proof line of Lemma 3. ∎

Assumption 10 (Partial coordination of stepsizes and regularization sequences).

For $j\in[m]$ , consider the stepsize sequences $\{\alpha_{k,j}\}_{k=0}^{\infty}$ and $\{\beta_{k,j}\}_{k=0}^{\infty}$ and the regularization sequence $\{\epsilon_{k,j}\}_{k=0}^{\infty}$ in Algorithm (102)-(103). Without loss of generality, for $j\in[m]$ we add the term $\epsilon_{-1,j}$ to the regularization sequence. We use the notation $u_{k,\min}:=\min_{j\in[m]}u_{k,i}$ , $u_{k,\max}:=\max_{j\in[m]}u_{k,j}$ for $u\in\{\alpha,\beta,\epsilon\}$ , $\Delta_{k}:=\alpha_{k,\max}-\alpha_{k,\min}$ , $\Gamma_{k}:=\epsilon_{k-1,\max}-\epsilon_{k,\min}$ and $\mathsf{B}_{k}:=\beta_{k,\min}(2-\beta_{k,\max})$ . We then assume that $0<\beta_{k,\min}\leq\beta_{k,\max}<2$ and

(i)

For each $j\in[m]$ , $\{\epsilon_{k,j}\}_{k=-1}^{\infty}$ is a decreasing positive sequence converging to zero.

(ii)

$\lim_{k\rightarrow\infty}\frac{\alpha_{k,\max}^{2}}{\alpha_{k,\min}\epsilon_{k,\min}}=0,$ * $\lim_{k\rightarrow\infty}\frac{\alpha_{k,\max}^{2}}{\mathsf{B}_{k}\alpha_{k,\min}\epsilon_{k,\min}}=0,$ $\lim_{k\rightarrow\infty}\frac{\Delta_{k}}{\alpha_{k,\min}\epsilon_{k,\min}}=0$ and $\lim_{k\rightarrow\infty}\alpha_{k,\min}\epsilon_{k,\min}=0.$ *

(iii)

$\sum_{k=0}^{\infty}\alpha_{k,\min}\epsilon_{k,\min}=\infty$ .

(iv)

$\sum_{k=0}^{\infty}\alpha_{k,\max}^{2}<\infty,$ * $\sum_{k=0}^{\infty}\frac{\alpha_{k,\max}^{2}}{\mathsf{B}_{k}}<\infty,$ $\sum_{k=0}^{\infty}\left(\frac{\Gamma_{k}}{\epsilon_{k,\min}}\right)^{2}\left(1+\alpha_{k,\min}^{-1}\epsilon_{k,\min}^{-1}\right)<\infty$ and*

[TABLE]

(v)

$\lim_{k\rightarrow\infty}\frac{\Gamma_{k}^{2}}{\epsilon_{k,\min}^{3}\alpha_{k,\min}}\left(1+\alpha_{k,\min}^{-1}\epsilon_{k,\min}^{-1}\right)=0.$ **

Assumption 10 contains usual conditions on the regularization parameters of Tykhonov algorithms and on the stepsize for SA algorithms, with certain coordination across stepsizes and regularization parameters. Assumption 10 includes Assumption 2 in [29] with the addition of (105), due to the use of approximate projections (in addition to asynchronous stepsizes).111We observe that this condition is trivially satisfied with synchronous stepsizes, i.e., $\alpha_{k,j}=\alpha_{k,\ell}$ for all $k,j,\ell$ . Next we show that Assumption 10 is satisfied by explicit stepsizes and regularization parameters.

Corollary 5 (Asynchronous stepsizes and regularization parameters).

Take $\delta\in(0,\frac{1}{2})$ and real numbers $\underline{C}\leq\overline{C}$ , $\underline{D}\leq\overline{D}$ . The following stepsizes and regularization parameters satisfy Assumption 10: for any $j\in[m]$ and $k\in\mathbb{N}_{0}$ , take $C_{j}\in[\underline{C},\overline{C}]$ , $D_{j}\in[\underline{D},\overline{D}]$ , $\beta_{j}\in(0,2)$ and

[TABLE]

Proof.

Except for condition (105), all other conditions in Assumption 10 are proved in Lemma 4 of [29]. We proceed with the proof of (105). Set $u_{\max}:=\max_{1\leq i\leq m}{u_{i}}$ , $u_{\min}=\min_{1\leq i\leq m}{u_{i}}$ for $u\in\{C,D\}$ and $a:=1/2+\delta$ , $b:=1/2-\delta$ . The claim is proved by showing that

[TABLE]

∎

4.5 Convergence analysis

We present next our convergence result for method (102)-(103). We shall need two lemmas.

Lemma 7 (Eventual strong-monotonicity).

Consider Assumption 8. Define $H_{k}:=D(\alpha_{k})\cdot(T+D(\epsilon_{k}))$ and $\sigma_{k}=\alpha_{k,\min}\epsilon_{k,\min}-L(\alpha_{k,\max}-\alpha_{k,\min}).$ Then for all $y,x\in\mathbb{R}^{n}$ and $k\in\mathbb{N}$ , $\langle H_{k}(y)-H_{k}(x),y-x\rangle\geq\sigma_{k}\|y-x\|^{2}.$

Proof.

We consider the decomposition

[TABLE]

Concerning the second term in the right hand side of (106), if $D_{k}$ is the diagonal matrix with entries $(\alpha_{1}\epsilon_{1},\ldots,\alpha_{m}\epsilon_{m})$ , then

[TABLE]

The first term in the right hand side of (106) is equal to

[TABLE]

The first term in the right hand side of (108) is nonnegative by monotonicity of $T$ . For the second term in the right hand side of (108), we have

[TABLE]

using Cauchy-Schwartz inequality in the first inequality, Hölder-inequality in the third one and Lipschitz continuity of $T$ in the last one. The result follows from (106)-(109). ∎

We will use the following result, proved in Lemma 3 of Koshal et al. [29]:

Lemma 8 (Properties of the Tykhonov sequence).

Assume that $X\subset\mathbb{R}^{n}$ is convex and closed, that the operator $T:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ is continuous and monotone over $X$ and that Assumption 1 hold. Assume also that the positive sequences $\{\epsilon_{k,j}\}_{k=-1}^{\infty}$ for $j\in[m]$ decrease to [math] and satisfy $\limsup_{k\rightarrow\infty}\frac{\epsilon_{k,\max}}{\epsilon_{k,\min}}<\infty$ , with $\epsilon_{k,\max}:=\max_{j\in[m]}\epsilon_{k,j}$ and $\epsilon_{k,\min}:=\min_{j\in[m]}\epsilon_{k,j}$ . Denote by $t^{k}$ the solution of VI $(T+D(\epsilon_{k}),X)$ . Then

(i)

$\{t^{k}\}$ * is bounded and all cluster points of $\{t^{k}\}$ belong to $X^{*}$ .*

(ii)

The following inequality holds for all $k\geq 1$ :

[TABLE]

where

[TABLE]

(iii)

If $\limsup_{k\rightarrow\infty}\frac{\epsilon_{k,\max}}{\epsilon_{k,\min}}\leq 1$ then $\{t^{k}\}$ converges to the least-norm solution in $X^{*}$ .

Recalling (38), we define

[TABLE]

which is a finite quantity, because $\{t^{k}\}$ is bounded and $B(\cdot)$ is a locally bounded function. We also define the following constants for given $\tau>1$ :

[TABLE]

Next we prove the asymptotic convergence of method (102)-(103).

Theorem 7 (Asymptotic convergence).

If Assumptions 8-10 hold, then the method (102)-(103) generates a sequence $\{x^{k}\}$ such that:

(i)

if $\limsup_{k\rightarrow\infty}\frac{\epsilon_{k,\max}}{\epsilon_{k,\min}}<\infty$ , then almost surely $\{x^{k}\}$ is bounded and all cluster points of $\{x^{k}\}$ belong to the solution set $X^{*}$ ,

(ii)

if $\limsup_{k\rightarrow\infty}\frac{\epsilon_{k,\max}}{\epsilon_{k,\min}}\leq 1$ , then almost surely $\{x^{k}\}$ converges to the least-norm solution in $X^{*}$ .

Proof.

In the sequel we denote by $\{t^{k}\}$ the Tykhonov sequence of Lemma 8. Let $x=(x_{j})_{j=1}^{m}\in X$ . We claim that for all $\tau>1$ , $j\in[m]$ , $k\in\mathbb{N}$ ,

[TABLE]

Indeed, in view of (102)-(103) and $x_{j}\in X^{j}\subset X_{0}^{j}\cap X_{\omega_{k,j}}^{j}$ , we can invoke Lemma 2 with $g:=g_{\omega_{k,j}}(j|\cdot)$ , $x_{1}:=x^{k}_{j}$ , $x_{2}:=x^{k+1}_{j}$ , $x_{0}:=x_{j}$ , $\alpha:=\alpha_{k,j}$ , $u:=F_{j}(v^{k}_{j},x^{k})+\epsilon_{k,j}x^{k}_{j}$ , $y:=y^{k}_{j}$ , $\beta:=\beta_{k,j}$ and $d:=d^{k}_{j}$ obtaining (113).

We define for $j\in[m]$ ,

[TABLE]

with $z^{k}:=(z_{j}^{k})_{j=1}^{m}$ . We use the definitions in (112) and sum the inequalities in (113) with $j$ between $1$ and $m$ , getting

[TABLE]

Concerning the second term in the right hand side of (115), we have

[TABLE]

using the definitions in (104) and (114).

We now analyze the third term in the right hand side of (115). The triangular inequality and the inequality $(\sum_{i=1}^{4}a_{i})^{2}\leq 4\sum_{i=1}^{4}a_{i}^{2}$ imply that

[TABLE]

Summing the inequalities in (117) with $j$ between $1$ and $m$ , we get from Assumption 3(i),

[TABLE]

Now we combine (115)-(118), in order to obtain

[TABLE]

where $H_{k,\tau},A_{k,\tau}$ and $G_{\tau}$ are defined as in (112).

The sum in the second term of the right hand side of (119) is equal to

[TABLE]

Recalling the definition of $\Delta_{k}$ in Assumption 10, it follows from Lemma 7 that the first term in the right hand side of (120) satisfies

[TABLE]

The second term in the right hand side of (120) is equal to

[TABLE]

The first term in the right hand side of (122) equals

[TABLE]

Regarding the second term in the right hand side of (122), we use $\Pi_{X^{j}}(x_{j})=x_{j}$ , so that for each $\mu\in(0,1)$ we have

[TABLE]

using Cauchy-Schwartz inequality in the first inequality, Lemma 1(ii) for $\Pi_{X^{i}}$ in the second one, the fact that $\|T(x)\|\leq B(x)$ in the third one, and the relation $2ab=-(a-b)^{2}+a^{2}+b^{2}$ in the fourth one. Putting together (122)-(124), we finally get that the second term in the right hand side of (120) is bounded by

[TABLE]

For the third term in the right hand side of (120), we have

[TABLE]

Combining (121), (125) and (126) with (120), we obtain

[TABLE]

We use (127) in (119) and finally get the following recursive relation: for all $k\in\mathbb{N}_{0}$ and $x\in X$ ,

[TABLE]

where for $R>0$ we define:

[TABLE]

In the sequel we specify $x:=t^{k}$ and take $\mathbb{E}[\cdot|\mathcal{F}_{k}]$ in (129), getting

[TABLE]

For deriving (130), we use the facts that $x^{k}\in\mathcal{F}_{k}$ , $\mathbb{E}[L(v^{k})^{2}|\mathcal{F}_{k}]=L^{2}$ and $\mathbb{E}[\|F(v^{k},t^{k})\|^{2}|\mathcal{F}_{k}]=2B(t^{k})^{2}\leq 2B_{t}^{2}$ , which hold because $\{v^{k}\}$ is independent of $\mathcal{F}_{k}$ and identically distributed to $v$ , $\|t^{k}\|\leq M_{t}$ and

[TABLE]

in view of the fct that $\mathbb{E}\left[\varsigma^{k}|\mathcal{F}_{k}\right]=0$ .

Using the definition of $C$ in (112), we get from Assumption 9 and the fact that $x_{j}^{k}\in\mathcal{F}_{k}$ :

[TABLE]

By (131), the last term in the right hand side of (130) is bounded by

[TABLE]

using the fact that $2ab\leq\lambda a^{2}+\frac{b^{2}}{\lambda}$ with $\lambda:=A_{k,\tau}$ , $a:=\operatorname*{d}(x^{k},X)$ and $b:=(B_{t}+\epsilon_{k,\max}M_{t})\alpha_{k,\max}$ .

Since $t^{k}$ solves VI $(T+D(\epsilon_{k}),X)$ , we have

[TABLE]

Next we relate $\|x^{k}-t^{k}\|^{2}$ to $\|x^{k}-t^{k-1}\|^{2}$ , using the properties of $t^{k}$ (Lemma 8). We have

[TABLE]

Using the relation $2ab\leq\lambda a^{2}+\frac{b^{2}}{\lambda}$ for any $\lambda>0$ , the last term in the rightmost expression in (134) can be estimated as

[TABLE]

Putting (135) in (134) yields

[TABLE]

Combining (130), (131)-(132), (133) and (136) we get

[TABLE]

We now estimate the coefficient $q_{k,\tau,\mu}(L)(1+\alpha_{k,\min}\epsilon_{k,\min})$ in (137). In view of (129), we have

[TABLE]

Assumption 10(ii) and $0<H_{k,\tau}=4[1+\beta_{k,\min}(2-\beta_{k,\max})\tau]\leq 4(1+\tau)$ guarantee that

[TABLE]

Since $\mu\in(0,1)$ is arbitrary, we can ensure the existence of $d\in(0,1)$ such that

[TABLE]

for all sufficiently large $k$ . Next we show that $q_{k,\tau,\mu}(L)\in(0,1)$ for large $k$ . Indeed, from (139) and $d\in(0,1)$ we have that $1<2-c_{k}<2$ for large enough $k$ , so that we obtain from (138),

[TABLE]

Finally, $\lim_{k\rightarrow\infty}\alpha_{k,\min}\epsilon_{k,\min}=0$ by Assumption 10(ii), so that (140) implies that $q_{k,\tau,\mu}(L)\in(0,1)$ for sufficiently large $k$ . Using this fact and (139) we get the following estimate:

[TABLE]

using (139) in the last inequality.

Combining (137), (141) and $A_{k,\tau}=\beta_{k,\min}(2-\beta_{k,\max})G_{\tau}^{-1}$ , we obtain

[TABLE]

for all sufficiently large $k$ , with $a_{k}:=\alpha_{k,\min}\epsilon_{k,\min}(1-d)$ and

[TABLE]

From (141) and $d\in(0,1)$ , we conclude that $a_{k}\in[0,1]$ , while from Assumption 10(iii) we have that $\sum_{k}a_{k}=\infty$ . From Assumption 10(iv) and (143), we also get that $\sum_{k}b_{k}<\infty$ . Finally, using the definitions of $\Gamma_{k}$ and $\mathsf{B}_{k}$ , we obtain from (143):

[TABLE]

for some positive constants $C_{1}$ , $C_{2}$ , $C_{3}$ and $C_{4}$ . Therefore, we get $\lim_{k\rightarrow\infty}b_{k}/a_{k}=0$ from Assumption 10(ii) and (v). These conditions, Theorem 2 and (142) imply that $\lim_{k\rightarrow\infty}\|x^{k}-t^{k-1}\|=0$ almost surely. The result follows from this fact and Lemma 8. ∎

4.6 Convergence rate analysis

Next we give feasibility and solvability convergence rates. The feasibility rate will be given in terms of the metric $\operatorname*{d}(\cdot):=\operatorname*{d}(\cdot,X)^{2}$ evaluated at

[TABLE]

i.e., the ergodic average of the iterates with weights $\mathsf{B}_{k}=\beta_{k,\min}(2-\beta_{k,\max})$ . Assuming that $X$ is compact (but allowing the hard constraint $X_{0}$ to be unbounded), the solvability convergence rate will be given in terms of the dual gap function $\mathsf{G}$ , defined in (24), evaluated at

[TABLE]

which is the ergodic average of the feasible projections of the iterates with weights $\alpha_{k,\max}$ . We shall use the notation $\overline{x}^{k}:=\Pi(x^{k})$ for $k\in\mathbb{N}_{0}$ .

In the remainder of this subsection we recall definitions (35), (38), (110)-(112) and the ones given in Assumption 10. We first present the feasibility rate. In order to facilitate the presentation, we define some constants. Given $\tau,\mu\in(0,1)$ and $R>0$ we set

[TABLE]

Theorem 8 (Feasibility rate).

Suppose Assumptions 8-10 hold. Then given $\tau,\mu\in(0,1)$ , for all $k\in\mathbb{N}_{0}$ ,

[TABLE]

Proof.

We recall relation (130) in the proof of Theorem 7. Instead of using (132), we bound the left hand side of (132) by

[TABLE]

using the facts that $2ab\leq\lambda a^{2}+\lambda^{-1}b^{2}$ with $\lambda=A_{k,\tau}/2$ , $a=\operatorname*{d}(x^{k})$ and $b=I_{t}\alpha_{k,\max}$ .

We combine (130), (131), (133) and (136) with (147), take total expectation and sum from [math] to $k$ in order to get

[TABLE]

In view of the convexity of $y\mapsto\operatorname*{d}(y)^{2}$ and the linearity of the expectation operator, we have

[TABLE]

Relations (148)-(149) prove the required claim. ∎

Next we present the solvability rate assuming that $X$ is compact. We will need the following definitions: for $\tau,\mu\in(0,1)$ and $R>0$ ,

[TABLE]

We start with an intermediate lemma.

Lemma 9 (Feasibility error control).

For any $I>0$ and $k\in\mathbb{N}_{0}$ ,

[TABLE]

Proof.

For $0\leq\ell\leq k$ , define

[TABLE]

We have

[TABLE]

using the fact that $Q_{k-1},x^{k}\in\mathcal{F}_{k}$ in the equality, (131) in the first inequality and the fact that $2ab\leq\lambda a^{2}+\lambda^{-1}b^{2}$ with $a:=I\alpha_{k,\max}$ , $b:=\operatorname*{d}(x^{k})$ and $\lambda:=G_{\tau}/\mathsf{B}_{k}$ in the second inequality. We then take $\mathbb{E}[\cdot|\mathcal{F}_{k-1}]$ in (153) and use the fact that $\mathbb{E}[\mathbb{E}[\cdot|\mathcal{F}_{k}]|\mathcal{F}_{k-1}]=\mathbb{E}[\cdot|\mathcal{F}_{k-1}]$ in order to obtain

[TABLE]

Proceeding by induction as in (153)-(154), we get

[TABLE]

Taking total expectation in (156) and using the fact that $\mathbb{E}[\mathbb{E}[\cdot|\mathcal{F}_{0}]]=\mathbb{E}[\cdot]$ , we prove the claim. ∎

Theorem 9 (Solvability rate).

Suppose that Assumptions 8-10 hold. Then, given $\tau,\mu\in(0,1)$ , for all $k\in\mathbb{N}_{0}$ ,

[TABLE]

Proof.

We recall relation (128) in the proof of Theorem 7, where $\varsigma^{k}$ is defined in (104). Regarding the second line of (128), we have for any $x\in X$ ,

[TABLE]

using Cauchy-Schwartz inequality and the definitions of $M_{X}$ and $\operatorname*{\mathcal{D}}(X)$ .

We set $Q(x,y):=\langle T(x),y-x\rangle$ so that $\mathsf{G}(y):=\sup_{x\in X}Q(x,y)$ as in (24). Using (158) in (128) and then summing from [math] to $k$ , we get for all $x\in X$ ,

[TABLE]

where the last line of (128) has been bounded using the definition of $I_{X}$ .

The total expectation of the term in the first line of (159) is bounded above by

[TABLE]

where in first line we used Lemma 4, $\|x^{i}-x\|^{2}\leq 2\operatorname*{d}(x^{i})^{2}+2\operatorname*{\mathcal{D}}(X)^{2}$ , $\|x-\overline{x}^{i}\|\leq\operatorname*{\mathcal{D}}(X)$ and $\|x\|\leq M_{X}$ for all $x\in X$ and $0\leq i\leq k$ , in second line we used the property $\mathbb{E}\{\mathbb{E}\{\cdot|\mathcal{F}_{i}\}\}=\mathbb{E}\{\cdot\}$ and $x^{i}\in\mathcal{F}_{i}$ and in third line we used $\mathbb{E}\left[h_{i,\tau,\mu}(L(v^{i}))\big{|}\mathcal{F}_{i}\right]=\mathbb{E}\left[h_{i,\tau,\mu}(L(v^{i}))\right]=h_{i,\tau,\mu}(L)$ (using Assumption 4).

We will now bound the last term in the right hand side of (159). We define

[TABLE]

We define $\{u^{k}\}$ recursively as follows. Take any $u^{0}\in X$ and set, for $k\in\mathbb{N}_{0}$ ,

[TABLE]

Note that $u^{k}\in\mathcal{F}_{k}$ . We write, for all $k\in\mathbb{N}_{0}$ ,

[TABLE]

Note that for all $k\in\mathbb{N}_{0}$ ,

[TABLE]

which follows from $u^{k},x^{k}\in\mathcal{F}_{k}$ and $\mathbb{E}[\varsigma^{k}]=0$ (Assumption 4).

Concerning the first term in the right hand side of (161), we have

[TABLE]

using Lemma 1(iii) with the definition of $u^{i+1}$ and $2ab\leq a^{2}+b^{2}$ with $a:=\|u^{i+1}-u^{i}\|$ and $b:=D(\alpha_{i})\overline{\varsigma}^{i}$ in the first inequality. Summing (163) from [math] to $k$ and then taking total expectation in (161) we get

[TABLE]

using the fact that $\|u^{0}-x\|\leq\operatorname*{\mathcal{D}}(X)$ and (162). Regarding the second term in the right hand side of (164), we have

[TABLE]

using the Lipschitz continuity of $F(v^{k},\cdot)$ and $T$ , $\overline{\varsigma}^{i}=F(v^{i},\overline{x}^{i})-F(v^{i},\overline{x}^{0})+T(\overline{x}^{0})-T(\overline{x}^{i})+F(v^{i},\overline{x}^{0})-T(\overline{x}^{0})$ , $(a+b+c)^{2}\leq 3a^{2}+3b^{2}+3c^{2}$ and $\overline{x}^{i}\in\mathcal{F}_{i}$ in the first inequality and that $\mathbb{E}[L(v^{i})^{2}|\mathcal{F}_{i}]=L^{2}$ and $\|\overline{x}^{i}-\overline{x}^{0}\|\leq\operatorname*{\mathcal{D}}(X)$ in the second inequality. The third term in the right hand side of (164) is equal to

[TABLE]

using Cauchy-Schwartz inequality and the fact that $\|x-u^{i}\|\leq\operatorname*{\mathcal{D}}(X)$ , in the first inequality, the Lipschitz continuity of $F(v^{k},\cdot)$ and $T$ in the second inequality, and that $\mathbb{E}[L(v^{i})|\mathcal{F}_{i}]\leq L$ and $x^{i}\in\mathcal{F}_{i}$ in the third inequality.

From the convexity of $y\mapsto Q(x,y)$ , we get

[TABLE]

We are now ready to prove the claim. We take total expectation in (159) and combine it with (160) and (164)-(167). In order to complete the proof, we use the obtained relation, combine the expectation of the fifth term

[TABLE]

in the right hand side of (159) with (166) and use Lemma 9 with $I:=I_{X}+2L\operatorname*{\mathcal{D}}(X)$ in order to obtain the final bound

[TABLE]

∎

Corollary 6 (Solvability and feasibility rates: asynchronous parameters).

*Suppose

that Assumptions 8-10 hold. Take stepsizes and regularization parameters as specified in Corollary 5. Then Theorem 7 and the following feasibility rate hold:*

[TABLE]

If additionally $X$ is compact, the following solvability rate holds: for any $\delta\in(0,\frac{1}{2})$ ,

[TABLE]

Proof.

The stated stepsizes and regularization parameters of Corollary 5 satisfy Assumption 10, so that a.s.-convergence follows from Theorem 7. In the sequel we fix $\mu,\tau\in(0,1)$ .

We first establish the feasibility rate. We have

[TABLE]

The first inequality in (168) follows from (141), which implies that $f_{k,\tau,\mu}$ is negative for all sufficiently large $k$ . The remaining inequalities in (168)-(169) follow from Corollary 5 and from the boundedness of $q_{k,\tau,\mu}(L)$ (see (140) in Theorem 7). The claimed feasibility rate follows from (168)-(169), Theorem 8 and the fact that $\sum_{i=0}^{k}\mathsf{B}_{i}=\min_{j\in[m]}\beta_{j}(2-\max_{j\in[m]}\beta_{j})k$ .

We now establish the solvability rate. We have

[TABLE]

Also, $h_{k,\tau,\mu}(L)$ is negative for sufficiently large $k$ (as shown by relation (140)) so $\sum_{i=0}^{\infty}h_{i,\tau,\mu}(L)\left\{\mathbb{E}\left[\operatorname*{d}(x^{i})^{2}\right]+\operatorname*{\mathcal{D}}(X)^{2}\right\}<\infty$ . This, (168)-(170) and Theorem 9 prove the claim on the solvability rate. ∎

Appendix

Proof of Proposition 1:

Proof.

Suppose that (29) holds and take $x^{*}\in X^{*}$ . If $\mathbb{T}_{X}(x^{*})\cap\mathbb{N}_{X^{*}}(x^{*})=\{0\}$ , then (26) holds trivially. Otherwise, take $d\in\mathbb{T}_{X}(x^{*})\cap\mathbb{N}_{X^{*}}(x^{*})$ with $d\neq 0$ . Since $d\in\mathbb{N}_{X^{*}}(x^{*})$ , the definition of $\mathbb{N}_{X^{*}}(x^{*})$ implies that $X^{*}$ is a subset of the halfspace $H_{d}^{-}:=\{y:\langle d,y-x^{*}\rangle\leq 0\}$ . In view of (20) and $d\in\mathbb{T}_{X}(x^{*})$ , there exist sequences $d^{k}\in\mathbb{R}^{n}$ , $t_{k}>0$ such that $x^{*}+t_{k}d^{k}\in X$ , $d^{k}\rightarrow d$ and $t^{k}\rightarrow 0$ . We claim that, taking a subsequence if needed,

[TABLE]

for all $k$ . Indeed, otherwise we would have

[TABLE]

for large enough $k$ . Dividing (172) by $t_{k}$ and letting $k\rightarrow\infty$ we get $d=0$ which entails a contradiction. Hence, (171) holds. From (29), $x^{*}\in X^{*}$ and $x^{*}+t_{k}d^{k}\in X$ we get

[TABLE]

using (171) and the fact that $X^{*}\subset H_{d}^{-}$ in the second inequality. Dividing (173) by $t_{k}$ and letting $k\rightarrow\infty$ , we conclude that (26) holds for $d$ .

Now suppose that (26) holds and that $T$ is constant on $X^{*}$ . Take $x\in X$ , $x^{*}\in X^{*}$ and let $\bar{x}:=\Pi_{X^{*}}(x)$ . Since $x,\bar{x}\in X$ and $X$ is closed and convex, we have that $x-\bar{x}\in\mathbb{T}_{X}(\bar{x})$ , using the first equality in (21). Since $T$ is monotone and $X$ is closed and convex, $X^{*}$ is closed and convex (see e.g. Facchinei and Pang [17], Theorem 2.3.5). From this fact, the fact that $\bar{x}=\Pi_{X^{*}}(x)$ and Lemma 1(i), we obtain that $x-\bar{x}\in\mathbb{N}_{X^{*}}(\bar{x})$ , using the definition of the polar cone. Thus, $x-\bar{x}\in\mathbb{T}_{X}(\bar{x})\cap\mathbb{N}_{X^{*}}(\bar{x})$ . We conclude from (26) that

[TABLE]

Since $T$ is constant on $X^{*}$ , we have

[TABLE]

using the fact that $\langle T(x^{*}),x^{*}-\bar{x}\rangle\leq 0$ , which holds because $x^{*}\in X^{*}$ and $\bar{x}\in X$ . The desired claim (29) follows from (175) and (174). ∎

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Auslender, A. and Teboulle, M. (2005) Interior projection-like methods for monotone variational inequalities, Mathematical Programming, Ser. A , Vol. 104, pp. 39–68.
2[2] Bach, F. and Moulines, E. (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning, Advances in Neural Information Processing Systems (NIPS).
3[3] Bauschke, H.H. (2001) Projection algorithms: results and open problems. In: Butnariu, D., Censor, Y., Reich, Y. (eds.) Inherently Parallel Algorithms in Feasibility and Optimization and their Applications , Elsevier, Amsterdam, pp. 11–22.
4[4] Bauschke, H.H. and Borwein, J.M. (1996) On projection algorithms for solving convex feasibility problems, SIAM Review , Vol. 38, pp. 367–426.
5[5] Bauschke, H.H., Combettes, H.H., Luke, D.R. (2003) Hybrid projection-reflection method for phase retrieval, Journal of the Optical Socety of America , Vol. A 20, pp. 1025–1034.
6[6] Bello Cruz, J.Y. and Iusem, A.N. (2012) An explicit algorithm for monotone variational inequalities, Optimization , Vol. 61, pp. 855–871.
7[7] Bello Cruz, J.Y. and Iusem, A.N., (2010) Convergence of direct methods for paramonotone variational inequalities, Computational Optimization and Applications , Vol. 46, pp. 247–263.
8[8] Bello Cruz, J.Y. and Iusem A.N. (2015) Full convergence of an approximate projections method for nonsmooth variational inequalities, Mathematics and Computers in Simulation , Vol. 114, pp. 2–13.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Incremental constraint projection methods for monotone stochastic variational inequalities

Abstract

1 Introduction

Definition 1** (SVI).**

1.1 Projection methods

1.2 Stochastic approximation methods

1.3 Proposed methods and contributions

2 Preliminaries

2.1 Projection operator and notation

Lemma 1**.**

Lemma 2**.**

Remark 1**.**

2.2 Probabilistic tools

Theorem 1**.**

Theorem 2**.**

2.3 Weak sharpness

Proposition 1**.**

Theorem 3**.**

3 An incremental projection method under weak sharpness

3.1 Statement of the algorithm

Algorithm 1** (Incremental constraint projection method).**

3.2 Discussion of the assumptions

Assumption 1** (Consistency).**

Assumption 2** (Monotonicity).**

Assumption 3** (Lipschitz-continuity or boundedness).**

Assumption 4** (IID sampling).**

Assumption 5** (Constraint sampling and regularity).**

Lemma 3** (Sufficient condition for Assumption 5).**

Proof.

Assumption 6** (Small stepsizes).**

Assumption 7** (weak sharpness).**

3.3 Convergence analysis

Lemma 4**.**

Lemma 5** (Recursive relations).**

Proof.

Theorem 4** (Asymptotic convergence).**

Proof.

3.4 Convergence rate analysis

Proposition 2** (Boundedness in L2L^{2}L2).**

Proof.

Theorem 5** (Solvability and feasibility rates of convergence: unbounded case).**

Proof.

Corollary 1** (Solvability and feasibility rates with robust stepsizes: unbounded case).**

Proof.

Theorem 6** (Solvability and feasibility rates: bounded case).**

Proof.

Corollary 2** (Solvability and feasibility rates with robust stepsizes: bounded case).**

Proof.

Remark 2**.**

Corollary 3** (Solvability and feasibility rates for larger stepsizes: bounded case).**

Proof.

Corollary 4** (A stochastic optimization problem).**

Proof.

4 An incremental projection method with regularization for Cartesian SVI

4.1 Cartesian structure

4.2 Constraint structure

4.3 Statement of the algorithm

Algorithm 2** (Regularized incremental projection method: distributed case).**

4.4 Discussion of the assumptions

Assumption 8**.**

Assumption 9** (Constraint sampling and regularity).**

Lemma 6** (Sufficient condition for Assumption 9).**

Proof.

Assumption 10** (Partial coordination of stepsizes and regularization sequences).**

Corollary 5** (Asynchronous stepsizes and regularization parameters).**

Proof.

4.5 Convergence analysis

Lemma 7** (Eventual strong-monotonicity).**

Proof.

Lemma 8** (Properties of the Tykhonov sequence).**

Theorem 7** (Asymptotic convergence).**

Proof.

4.6 Convergence rate analysis

Theorem 8** (Feasibility rate).**

Definition 1 (SVI).

Lemma 1.

Lemma 2.

Remark 1.

Theorem 1.

Theorem 2.

Proposition 1.

Theorem 3.

Algorithm 1 (Incremental constraint projection method).

Assumption 1 (Consistency).

Assumption 2 (Monotonicity).

Assumption 3 (Lipschitz-continuity or boundedness).

Assumption 4 (IID sampling).

Assumption 5 (Constraint sampling and regularity).

Lemma 3 (Sufficient condition for Assumption 5).

Assumption 6 (Small stepsizes).

Assumption 7 (weak sharpness).

Lemma 4.

Lemma 5 (Recursive relations).

Theorem 4 (Asymptotic convergence).

Proposition 2 (Boundedness in $L^{2}$ ).

Theorem 5 (Solvability and feasibility rates of convergence: unbounded case).

Corollary 1 (Solvability and feasibility rates with robust stepsizes: unbounded case).

Theorem 6 (Solvability and feasibility rates: bounded case).

Corollary 2 (Solvability and feasibility rates with robust stepsizes: bounded case).

Remark 2.

Corollary 3 (Solvability and feasibility rates for larger stepsizes: bounded case).

Corollary 4 (A stochastic optimization problem).

Algorithm 2 (Regularized incremental projection method: distributed case).

Assumption 8.

Assumption 9 (Constraint sampling and regularity).

Lemma 6 (Sufficient condition for Assumption 9).

Assumption 10 (Partial coordination of stepsizes and regularization sequences).

Corollary 5 (Asynchronous stepsizes and regularization parameters).

Lemma 7 (Eventual strong-monotonicity).

Lemma 8 (Properties of the Tykhonov sequence).

Theorem 7 (Asymptotic convergence).

Theorem 8 (Feasibility rate).

Lemma 9 (Feasibility error control).

Theorem 9 (Solvability rate).

Corollary 6 (Solvability and feasibility rates: asynchronous parameters).