Incremental constraint projection methods for monotone stochastic variational inequalities
Alfredo Iusem, Alejandro Jofr\'e, Philip Thompson

TL;DR
This paper introduces an incremental constraint projection method for stochastic variational inequalities with monotone operators, achieving convergence and rate guarantees suitable for large-scale, online, and distributed applications.
Contribution
It proposes a novel incremental projection approach combining stochastic approximation with constraint sampling, extending to weak-sharp and monotone cases with convergence rates.
Findings
Achieves $O(1/k)$ feasibility rate in mean squared distance.
Provides $O(1/\sqrt{k})$ solvability rate for weak-sharp cases.
Extends to distributed stochastic Nash games with near-optimal convergence.
Abstract
We consider stochastic variational inequalities with monotone operators defined as the expected value of a random operator. We assume the feasible set is the intersection of a large family of convex sets. We propose a method that combines stochastic approximation with incremental constraint projections meaning that at each iteration, a step similar to some variant of a deterministic projection method is taken after the random operator is sampled and a component of the intersection defining the feasible set is chosen at random. Such sequential scheme is well suited for applications involving large data sets, online optimization and distributed learning. First, we assume that the variational inequality is weak-sharp. We provide asymptotic convergence, feasibility rate of in terms of the mean squared distance to the feasible set and solvability rate of (up to firstβ¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Incremental constraint projection methods for monotone stochastic variational inequalities
A. N. Iusem, Instituto Nacional de MatemΓ‘tica Pura e Aplicada (IMPA), [email protected]
ββ
Alejandro JofrΓ©, Center for Mathematical Modeling (CMM) & DIM, [email protected]
ββ
Philip Thompson, Instituto Nacional de MatemΓ‘tica Pura e Aplicada (IMPA), [email protected]
Abstract
We consider stochastic variational inequalities with monotone operators defined as the expected value of a random operator. We assume the feasible set is the intersection of a large family of convex sets. We propose a method that combines stochastic approximation with incremental constraint projections meaning that at each iteration, a step similar to some variant of a deterministic projection method is taken after the random operator is sampled and a component of the intersection defining the feasible set is chosen at random. Such sequential scheme is well suited for applications involving large data sets, online optimization and distributed learning. First, we assume that the variational inequality is weak-sharp. We provide asymptotic convergence, feasibility rate of in terms of the mean squared distance to the feasible set and solvability rate of (up to first order logarithmic terms) in terms of the mean distance to the solution set for a bounded or unbounded feasible set. Then, we assume just monotonicity of the operator and introduce an explicit iterative Tykhonov regularization to the method. We consider Cartesian variational inequalities so as to encompass the distributed solution of stochastic Nash games or multi-agent optimization problems under a limited coordination. We provide asymptotic convergence, feasibility rate of in terms of the mean squared distance to the feasible set and, in the case of a compact set, we provide a near-optimal solvability convergence rate of in terms of the mean dual gap-function of the SVI for arbitrarily small .
1 Introduction
The standard (deterministic) variational inequality problem, which we will denote as VI( or simply VI, is defined as follows: given a closed and convex set and a single-valued operator , find such that, for all ,
[TABLE]
We shall denote by the solution set of VI. The variational inequality problem includes many interesting special classes of variational problems with applications in economics, game theory and engineering. The basic prototype is smooth convex optimization, where is the gradient of a smooth function. Other classes of problems are posed as variational inequalities which are not equivalent to optimization problems, such as complementarity problems (with ), system of equations (with ), saddle-point problems and many different classes of equilibrium problems.
In the stochastic case, we start with a measurable space , a measurable (random) operator and a random variable defined on a probability space which induces an expectation and a distribution of . When no confusion arises, we use to also denote a random sample . We assume that for every , is an integrable random vector. The solution criterion analyzed in this paper consists of solving VI() as defined by (1), where is the expected value of , i.e., for any ,
[TABLE]
Precisely, the definition of stochastic variational inequality problem (SVI) is:
Definition 1** (SVI).**
Assuming that is given by for all , the SVI problem consists of finding , such that for all .
Such formulation of SVI is also called expected value formulation. It goes back to GΓΌrkan et al. [19], as a natural generalization of stochastic optimization (SP) problems. Recently, a more general definition of stochastic variational inequality was considered in Chen et al. [15] where the feasible set is also affected by randomness, that is, is a random set-valued function.
Methods for the deterministic VI() have been extensively studied (see Facchinei and Pang [17]). If is fully available then SVI can be solved by these methods. As in the case of SP, the SVI in Definition 1 becomes very different from the deterministic setting when is not available. This is often the case in practice due to expensive computation of the expectation in (2), unavailability of or no close form of . This requires sampling the random variable and the use of values of given a sample of and a current point (a procedure often called βstochastic oracleβ call). In this context, there are two current methodologies for solving the SVI problem: sample average approximation (SAA) and stochastic approximation (SA). In this paper we focus on the SA approach.
The SA methodology for SP or SVI can be seen as a projection-type method where the exact mean operator is replaced along the iterations by a random sample of . This approach induces an stochastic error for along the trajectory of the method. When , Definition 1 becomes the stochastic equation problem (SE): under (2), almost surely find such that . The SA methodology was first proposed by Robbins and Monro in [40] for the SE problem in the case in which is the gradient of a strongly convex function under specific conditions. Since this fundamental work, SA approaches to SP and, more recently for SVI, have been carried out in Jiang and Xu [23], Juditsky et al. [24], Yousefian et al. [46], Koshal et al. [29], Wang and Bertsekas [43], Chen et al. [14], Yousefian et al. [47], Kannan and Shanbhag [25], Yousefian et al. [45]. See Bach and Moulines [2] for the stochastic approximation procedure in machine learning and online optimization.
A frequent additional difficulty is the possibly complicated structure of the feasible set . Often, the feasible set takes the form
[TABLE]
where is an arbitrarily family of closed convex sets. There are different motivations for considering the design of algorithms which, at every iteration, use only a component rather than the whole feasible set . First, in the case of projection methods, when the orthogonal projection onto each , namely , is much easier to compute than the projection onto , namely , a natural idea consists of replacing, at iteration , by one of the βs, say , or even by an approximation of . This occurs, for instance, when is a polyhedron and the βs are halfspaces. This procedure is the basis of the so called sequencial or parallel row action methods for solving systems of equations (see Censor [12]) and methods for the feasibility problem, useful in many applications, including image restoration and tomography (see, e.g., Bauschke et al. [5], Cegielski and Suchocka [11]). Second, in some cases is not known a priori, but is rather revealed through the random realizations of its components . Such problems arise in fair rate allocation problems in wireless networks where the channel state is unknown but the channel states are observed in time (see e.g. NediΔ [32] and Huang et al. [20]). Third, in some cases is known but the number of constraints is prohibitively very large (e.g., in machine learning and signal processing).
1.1 Projection methods
In the deterministic setting (1), the classical projection method for VI, akin to the projected gradient method for convex optimization, is
[TABLE]
where is the projection operator onto and is an exogenous sequence of positive stepsizes. Convergence of this method is guaranteed assuming is strongly monotone, Lipschitz continuous and the stepsizes satisfy and , where is the modulus of strong monotonicity and is the Lipschitz constant, see e.g. Facchinei and Pang [17].
The strong monotonicity assumption is quite demanding, and convergence of (3) is not guaranteed when the operator is just monotone. In order to deal with this situation, Korpelevich [28] proposed the extra-gradient algorithm
[TABLE]
in which an additional auxiliary projection step is introduced. Convergence of the method is guaranteed when the stepsizes satisfy . In Nemirovski [35], the extra-gradient method was generalized and convergence rates were established assuming compactness of the feasible set.
Observe that the projection method (3) and the extra-gradient method (1.1) are explicit, i.e., the formula for obtaining is an explicit one, up to the computation of the orthogonal projection . An implicit approach for the solution of monotone variational inequalities consists of a Tykhonov or proximal regularization scheme (see Facchinei and Pang [17], Chapter 12). In these methods, a sequence of regularized variational inequality problems are approximately solved at each iteration.
As commented before, a typical case occurs when the feasible set takes the form where all the βs are closed and convex. Row action methods and alternate (or cyclic) projection algorithms for convex feasibility problems exploit the computation of projections onto the components iteratively (see Bauschke [3]). In such case, the order in which the sets are used along the iterations, i.e. the so called control sequence , must be specified. Several options have been considered in the literature (such as cyclic control, almost cyclical control, most violated constraint control and random control). A negative consequence of the use of approximate projections is the need to use small stepsizes, i.e., satisfying and , which significantly reduces the efficiency of the method. We thus have a trade-off between easier projection computation and slower convergence. Additionally, the use of approximate projections require some condition on the feasible set, so that the projections onto the sets βs are reasonable approximations of the projection onto . For this, some form of error bound, linear regularity or Slater-type conditions on the sets must be assumed (e.g., Assumption 5 in Subsection 3.2 and the comments following it). See Bauschke and Borwein [4], Deutsch and Hundal [16] and Pang [36]. Explicit methods for monotone variational inequalities using approximate projections were studied e.g. in Fukushima [18] and Censor and Gibali [13], imposing rather demanding coercivity assumptions on , in Bello Cruz and Iusem [7] assuming paramonotonicity of , and then in Bello Cruz and Iusem [8] assuming just monotonicity of . Another method of this type, using an Armijo search as in Iusem and Svaiter [22] for determining the stepsizes, and approximate projections with the most violated constraint control, can be found in Bello Cruz and Iusem [6].
Related to row-action and alternate projective methods are the so called incremental methods, introduced in Kibardin [27] (see also Luo and Tseng [30], Bertsekas [9], NediΔ [32] and references therein). These methods are used for the minimization of a large sum of convex functions, e.g. in machine learning applications. In such a context, instead of using the gradient of the sum, the gradient of one of the terms is selected iteratively under different control rules. In Polyak [38], Polyak [39] and NediΔ [32], incremental constraint methods with random control rules were proposed for minimizing a convex function over an intersection of a large number convex sets. The feasible set takes the form
[TABLE]
where is a collection of closed and convex subsets of . The hard constraint is assumed to have easy computable projections. The soft constraints , for a given , has the form:
[TABLE]
for some convex function with positive part and easy computable subgradients. The method on NediΔ [32] is given by:
[TABLE]
[TABLE]
where are positive stepsizes, if , and for any if . In the method (7)-(8), is a random control sequence taking values in and satisfying certain conditions and is a convex smooth function (the non-smooth case is also analyzed). Together with row-action and alternate projection methods, incremental constraint projection methods can be viewed as the dual version of (standard) incremental methods. More recently, stochastic approximation was incorporated to incremental constraint projections methods for stochastic convex minimization problems in Wang and Bertsekas [44].
1.2 Stochastic approximation methods
The first SA method for SVI was analyzed in Jiang and Xu [23]. Their method is:
[TABLE]
where is the Euclidean projection onto , is a sample of and is a sequence of positive steps. The a.s. convergence is proved assuming -Lipschitz continuity of , strong monotonicity or strict monotonicity of , stepsizes satisfying (with in the case where is -strongly monotone) and an unbiased oracle with uniform variance, i.e., there exists such that for all ,
[TABLE]
After the above mentioned work, recent research on SA methods for SVI have been developed in Juditsky et al. [24], Yousefian et al. [45, 46, 47], Koshal et al. [29], Chen et al. [14], Kannan and Shanbhag [25]. Two of the main concerns in these papers were the extension of the SA approach to the general monotone case and the derivation of (optimal) convergence rate and complexity results with respect to known metrics associated to the VI problem. In order to analyze the monotone case, SA methodologies based on the extragradient method of Korpelevich [28], the mirror-prox algorithm of Nemirovski [35] and iterative Tykhonov and proximal regularization procedures (see Kannan and Shandbag [26]), were used in these works. Other objectives were the use of incremental constraint projections in the case of difficulties accessing the feasible set in Wang and Bertsekas [43], the convergence analysis in the absence of the Lipschitz constant in Yousefian et al. [45, 46, 47], and the distributed solution of Cartesian variational inequalities in Yousefian et al. [46], Koshal et al. [29].
We finally make some comments on two recent methods upon which we make substantial improvements.
In Wang and Bertsekas [43], method (9) is improved by incorporating an incremental projection scheme, instead of exact ones. They take , where is a finite index set, and use a random control sequence, where both the random map and the control sequence are jointly sampled, giving rise to the following algorithm:
[TABLE]
where are positive stepsizes and are samples. When , the method is the version of method (9) with incremental constraint projections. For convergence, the operator is assumed to be strongly monotone and Lipschitz-continuous and knowledge of the strong monotonicity and Lipschitz moduli are required for computing the stepsizes. In this setting, method (1.2) improves upon method (7)-(8) when , is finite and the projection onto each is easy.
Regularized iterative Tychonov and proximal point methods for monotone stochastic variational inequalities were introduced in Koshal et al. [29]. In such methods, instead of solving a sequence of regularized variational inequality problems, the regularization parameter is updated in each iteration and a single projection step associated with the regularized problem is taken. This is desirable since (differently from the deterministic case), termination criteria are generally hard to meet in the stochastic setting. The algorithm proposed allows for a Cartesian structure on the variational inequality, so as to encompass the distributed solution of Cartesian SVIs. Namely, the feasible set has the the form where each Cartesian component is a closed and convex set, and the random operator has components with for and . The algorithm in Koshal et al. [29] is described as follows. Given the -th iterate with components , for , the next iterate is given by the distributed projection computations: for ,
[TABLE]
where are the stepsize sequences, are the regularization parameter sequences and are the samples. This method is shown to converge under monotonicity and Lipschitz-continuity of and a partial coordination between the stepsize and regularization parameter sequences (see Assumption 10). The iterative proximal point follows a similar pattern but differently from the Tykhonov method, this method requires strict monotonicity, which in particular implies uniqueness of solutions. It should me mentioned that two important classes of problems which can be formulated as stochastic Cartesian variational inequalities are the stochastic Nash equilibria and the stochastic multi-user optimization problem; see Koshal et al. [29] for a precise definition. In these problems, the -th agent has only access to its constraint set and (which depends on other agents decisions) so that a distributed solution of the SVI is required. Moreover, it is convenient to allow agents to update independently their stepsizes and regularization sequences, subjected just to a limited coordination.
1.3 Proposed methods and contributions
In many stochastic approximation methods, the stochastic error is assumed to be bounded, demanding the use of small stepsizes with a slow performance. In this case, the use of easily computable approximate projections, instead of exact ones, can significantly improve the performance of the algorithm. Additionally, in many cases the constraint set is known, but it contains a very large number of constraints, or is not known a priori, but is rather learned along time through random samples of its constraints. An important feature of incremental constraint projection methods is that they process sample operators and sample constraints sequentially. This incremental structure is well suited for a variety of applications involving large data sets, online optimization and distributed learning. For problems that require online learning, incremental projection methods of the type (7)-(8) or (1.2) are practically the only option to be used without the knowledge of all the constraints.
In view of these considerations, we wish to devise methods which incorporate incremental constraint projections with stochastic approximation of the operator. There has been only one previous work on incremental projections for SVIs, namely Wang and Bertsekas [43]. In this work strong monotonicity of the operator and knowledge of the strong monotonicity and Lipschitz moduli were assumed. These are very demanding assumptions in practice and theory. Our first objective is to weaken such property to plain monotonicity without requiring knowledge of the Lipschitz constant. Our second objective is to use incremental constraint projections in distributed methods for multi-agent optimization and equilibrium problems arising in networks. Such joint analysis seems to be new (to the best of our knowledge, all previous works in distributed methods for such problems use exact projections). This objective is a non-trivial generalization of previous known distributed methods since, besides preserving the parallel computations of projections and the use of asynchronous agentβs parameters of such methods, we wish to allow each user to project inexactly over its decision set in a random fashion and without additional coordination.
Assuming the structures (5)-(6), in the centralized case (), we propose the following incremental constraint projection method:
[TABLE]
where are stepsize sequences, is the regularization parameter sequence, is the sample sequence, is the random control, and if and for any otherwise. We remark that the projection onto in (13) is dispensable if and is uniformly bounded on , a condition satisfied, e.g., if the soft constraints have easy computable projections, as commented below (see Remark 1 in Subsection 2.1). The above incremental algorithm advances in such a way that the βoperator stepβ and the βfeasibility stepβ are updated in separate stages. In the first stage, given the current iterate , the method advances in the direction of a sample of the random operator, producing an auxiliary iterate . In this step, the hard constraint set is considered while the soft constraints are βignoredβ. In the second stage, a soft constraint is randomly chosen with , and the method advances in the direction opposite to a subgradient of at the point , producing the next iterate . Thus, the method exploits simultaneously the stochastic approximation of the random operator (in the first stage) and a randomization of the incremental selection of constraint projections (in the second stage). In Section 3, this method is analyzed with no regularization, i.e., and the monotone operator satisfies the weak sharpness property (see Section 2.3) while in Section 4, we consider the same method with positive regularization parameters requiring just monotonicity of the operator.
We make some remarks to illustrate that the mentioned framework is very general. If, for , the Euclidean projection onto is easy, then we can always construct a function with βeasyβ subgradients. Indeed, defining the function for , then is convex, nonnegative and finite valued over , and for any ,
[TABLE]
provides a subgradient which is easy to evaluate. Moreover, for all . In this case, using the above directions as subgradients of at , method (13)-(14) can be rewritten as
[TABLE]
If, additionally, and then the method takes the more basic form
[TABLE]
In Section 4, we analyse a distributed variant. In this setting, the feasible set has the form where each Cartesian component is a closed and convex set, with , for and . Moreover, we assume each Cartesian component has the constraint form
[TABLE]
where is a collection of closed and convex subsets of . Also, for every , we assume is representable in as
[TABLE]
for some convex function . We thus propose the following distributed method: for each ,
[TABLE]
where, for every agent , are stepsize sequences, is the regularization parameter sequence, is the sample sequence, is the random control and if , and for any otherwise. Method (13)-(14) is the special case of (17)-(18) with .
We mention the following contributions of methods (13)-(14) and (17)-(18):
- (i)
Incremental constraint projection methods for plain monotone SVIs: In Wang and Bertsekas [43], incremental constraint projection methods for SVIs were proposed assuming strong monotonicity with knowledge of the strong monotonicity and Lipschitz moduli. We propose a method with incremental constraint projections for SVIs requiring just monotonicity with no knowledge of the Lipschitz constant, making our method much more general and applicable. Using explicit stepsizes, we establish almost sure asymptotic convergence, feasibility rate of in terms of the mean squared distance to the feasible set and, in the case of a compact set, we provide a near optimal solvability convergence rate of in terms of the mean dual gap function of the SVI for arbitrary small .
- (ii)
Incremental constraint projections in distributed methods: Distributed methods for SVIs have recently attained importance recently in the framework of optimization or equilibrium problems in networks. In this context, one important goal is to allow distributed computation of projections, allow agents to update their parameters independently and drop the strong or strict monotonicity property without indirect regularization which is hard to cope with in the stochastic setting. The work in Koshal et al. [29] addresses these issues but using exact projections, and to the best of our knowledge, all previous works in distributed methods, even for convex optimization, seem to project exactly. Our main contribution in this context is to include incremental projections in distributed methods for SVI (and in particular for stochastic optimization). In this context, we allow agents to project randomly in simpler components of its own decision set without information of other agentsβ decision sets. Importantly, we preserve all properties in Koshal et al. [29] just mentioned. The use of incremental projections allows easier computation of projections or flexibility when the constraints are learned via an online procedure. In order to achieve such contribution, we deal with a more refined convergence analysis and a new partial coordination assumption, not needed in the case of synchronous stepsizes or exact projections:
[TABLE]
where , and . Using explicit asyncronous stepsizes and regularization sequences, we establish a.s. asymptotic convergence, feasibility rate of in terms of the mean squared distance to the feasible set and, in the case of a compact feasible set, we provide a near optimal solvability convergence rate of in terms of the mean dual gap function of the SVI for arbitrary small . The partial coordination (19) appears in the rate statements as a decaying error related to the use of asynchronous stepsizes and asynchronous inexact random projections. To the best of our knowledge, even for the case of exact projections no convergence rates have been reported for iterative distributed methods for SVIs.
- (iii)
Weak sharpness property and incremental projections: The weak sharpness property for VIs was proposed in [31]. It has been used as a sufficient condition for finite convergence of algorithms for optimization and VI problems in numerous works, e.g. [31, 10]. To the best of our knowledge, the use of the weak sharpness property as a suitable property for incremental projection methods, as analyzed in this work, has not been addressed before, even for VIs or optimization problems in the deterministic setting. We use an equivalent form of weak sharpness suitable for incremental projections. The proof of such equivalence seems to be new. Using explicit stepsizes without knowledge of the sharp-modulus, we prove a.s. asymptotic convergence, feasibility rate of in terms of the mean squared distance to the feasible set and solvability rate of (up to first order logarithmic terms) in terms of the mean distance to the solution set, for bounded or unbounded feasible sets. We also prove that after a finite number of iterations, any solution of a stochastic optimization problem with linear objective and the same feasible set as the SVI is a solution of the original SVI. We note that the weak sharpness property differs from strong monotonicity, allowing nonunique solutions. In that respect such analysis complements item (i) above.
The paper is organized as follows: Section 2 includes preliminary results such as tools from the projection operator and probability, as well as required preliminaries on the weak sharpness property. Section 3 analyzes the method for weak sharp monotone operators. Subsection 3.4 presents the correspondent complexity analysis. Section 4 deals with the regularized version for general monotone operators. Subsection 4.6 presents the correspondent complexity analysis. We list the assumptions in each section, along with the algorithm statements and their convergence analysis.
2 Preliminaries
2.1 Projection operator and notation
For , we denote the standard inner product and the correspondent Euclidean norm. We shall denote by the distance function to a general set , namely, . For as in Definition 1 we denote . By and we denote the closure and the diameter of the set , respectively. For a closed and convex set , we denote by the orthogonal projection onto . For a function we denote by its positive part, defined by for . If is convex, we denote by its subdifferential and its domain.
The following properties of the projection operator are well known; see e.g. Facchinei and Pang [17] and Auslender and Teboulle [1].
Lemma 1**.**
Take a closed and convex set . Then
- i)
For all ,
- ii)
For all ,
- iii)
Let , with . Then for all ,
[TABLE]
The following lemma will be used in the analysis of the methods of Sections 3 and 4. It is proved in NediΔ [32] and Polyak [38], but in a slightly different form, suitable for convex optimization problems. The changes required for the case of monotone variational inequalities are straightforward.
Lemma 2**.**
Consider a closed and convex , and let be a convex function with . Suppose that there exists such that for all and all . Take , , , and define as
[TABLE]
where is such that if . Then for any such that , and any , it holds that
[TABLE]
Remark 1**.**
We remark that if and the subgradients of are uniformly bounded over , then the result of Lemma 2 holds with given as instead of .
The abbreviation βa.s.β means βalmost surelyβ and the abbreviation βi.i.d.β means βindependent and identically distributedβ. Given sequences and , the notation or means that there exists , such that for all . The notation means and . Given a -algebra and a random variable , we denote by and the expectation and conditional expectation, respectively. Also, we write for β is -measurableβ. indicates the -algebra generated by the random variables . denotes the set of natural numbers including zero. For , we use the notation . For , denotes the smallest integer greater than . We denote by the interior of the nonnegative orthant .
2.2 Probabilistic tools
As in other stochastic approximation methods, a fundamental tool to be used is the following Convergence Theorem of Robbins and Siegmund [41], which can be seen as the stochastic version of the properties of quasi-FejΓ©r convergent sequences.
Theorem 1**.**
Let be sequences of non negative random variables, adapted to the filtration , such that a.s. , and for all , \mathbb{E}\big{[}y_{k+1}\big{|}\mathcal{F}_{k}\big{]}\leq(1+a_{k})y_{k}-u_{k}+b_{k}. Then a.s. converges and .
We will also use the following result, whose proof can be found in Lemma 10 of Polyak [37].
Theorem 2**.**
Let be sequences of nonnegative random variables, adapted to the filtration , such that a.s. , , , and for all , \mathbb{E}\big{[}y_{k+1}\big{|}\mathcal{F}_{k}\big{]}\leq(1-a_{k})y_{k}+b_{k}. Then a.s. converges to zero.
2.3 Weak sharpness
We briefly discuss the weak sharpness property of variational inequalities. For and , denotes the normal cone of at , given by
[TABLE]
The tangent cone of at is defined as
[TABLE]
For a closed and convex set , the tangent cone at a point has the following alternative representations (see Rockafellar and Wets [42], Proposition 6.9 and Corollary 6.30):
[TABLE]
where for a given set , the polar set is defined as
In Burke and Ferris [10], the notion of weak sharp minima for the problem with solution set was introduced: there exists such that
[TABLE]
for all , where is the minimum value of at . Relation (22) means that gives an error bound on the solution set . In Burke and Ferris [10], it is proved that if is a closed, proper, and differentiable convex function and if the sets and are nonempty, closed, and convex, then (22) is equivalent to the following geometric condition: for all ,
[TABLE]
In optimization problems, the objective function can be used for determining regularity of solutions. In variational inequalities one can use for that purpose the above geometric definition or exploit the use of gap functions associated to the VI. The dual gap function is defined as
[TABLE]
In the sequel, we denote by the unit ball in and by the solution set of VI. In order to define a meaningful notion of weak sharpness for VIs, the following statements were considered in Marcotte and Zhu [31]:
- (i)
There exists , such that for all ,
[TABLE]
- (ii)
There exists , such that for all ,
[TABLE]
- (iii)
For all ,
[TABLE]
- (iv)
There exist such that for all ,
[TABLE]
Statement (iii) is the definition of a weak sharp VI given in Marcotte and Zhu [31]. In Theorem 4.1 of Marcotte and Zhu [31], it was proved that (i)-(ii) are equivalent, and that (i)-(iv) are equivalent when is compact and is paramonotone (also known as monotone+) i.e., is motonone and for all (see Iusem [21] for other properties of paramonotone operators).
Relation (28) means that the gap function provides an error bound on the solution set . Paramonotonicity implies that is constant on the solution set . Important classes of paramonotone operators are, for example, co-coercive, symmetric monotone and strictly monotone composite operators (see Facchinei and Pang [17], Chapter 2).
Recently, the following assumption was introduced in Yousefian et al. [47]: there exists such that for all and all ,
[TABLE]
Clearly, (29) implies (28). We show next that (29) implies (26) and the converse statement holds when is constant on . Thus, when is constant on , (25), (26) and (29) are equivalent, and when is paramonotone and is compact, conditions (25)-(29) are all equivalent. Hence, the following proposition, which appears to be new and is proved in the Appendix, gives a precise relation between property (29) with the previous notions of weak sharpness (25)-(28) presented in Marcotte and Zhu [31]. Property (29) is well suited for the incremental constraint projection-type methods considered here.
Proposition 1**.**
Let be a continuous monotone operator and a closed and convex set. The following holds:
- i)
Condition (29) implies (26).
- ii)
If is constant on , then (26) implies (29).
Finally, we will use the following result in Theorem 4.2. of Marcotte and Zhu [31]:
Theorem 3**.**
If is continuous and there exists such that , then .
As a consequence of Theorem 3 under weak sharpness and uniform continuity of , any algorithm which generates a sequence such that has the property that after a finite number of iterations , any solution of the auxiliary program with a linear objective, is a solution of the original variational inequality (see Theorem 5.1 in Marcotte and Zhu [31]). When is a polyhedron, this result can be interpreted as a finite convergence property of algorithms for VI with the weak sharpness property, since a linear program is finitely solvable. Other algorithmic implications of weak sharpness are developed in Marcotte and Zhu [31].
3 An incremental projection method under weak sharpness
In the following section we assume that the feasible set has the form
[TABLE]
where is a collection of closed and convex subsets of . We assume that the evaluation of the projection onto is computationally easy and that for all , is representable as
[TABLE]
for some convex function with . Also we assume that, for every , subgradients of at points are easily computable and that is uniformly bounded over , that is, there exists such that
[TABLE]
3.1 Statement of the algorithm
Next we formally state the algorithm.
Algorithm 1** (Incremental constraint projection method).**
Initialization:* Choose the initial iterate , the stepsizes and , the random controls and the operator samples .* 2. 2.
Iterative step:* Given , define:*
[TABLE]
where if ; if .
3.2 Discussion of the assumptions
In the sequel we consider the natural filtration
[TABLE]
Next we present the assumptions necessary for our convergence analysis.
Assumption 1** (Consistency).**
The solution set of VI is nonempty.
Assumption 2** (Monotonicity).**
The mean operator in (2) satisfies: for all ,
[TABLE]
Assumption 3** (Lipschitz-continuity or boundedness).**
We suppose is continuous and, at least, one of the following assumptions hold:
- (i)
There exists measurable with finite second moment, such that a.s. for all ,
[TABLE]
We denote .
- (ii)
There exists such that
[TABLE]
Item (i) implies in particular that is -Lipschitz continuous. Both items (i) or (ii) are standard in stochastic optimization. Let
[TABLE]
denote the variance of for . Both item (ii) and (10) imply that the variance function is bounded above uniformly over . Item (i) is a weaker assumption since it only requires the map to be finite at every point in (allowing to be unbounded). Except for Wang and Bertsekas [43] in the strongly monotone case, conditions in item (ii) or in (10) were requested in all the previous literature on SA methods for SVI or stochastic optimization. Under Assumption 3(i), we do not require (10).
Assumption 4** (IID sampling).**
The sequence is an independent identically distributed sample sequence of .
The above assumption implies in particular that a.s. for all and all , \mathbb{E}\big{[}F(v^{k},x)\big{|}\mathcal{F}_{k}\big{]}=T(x). We now state the assumptions concerning the incremental projections.
Assumption 5** (Constraint sampling and regularity).**
There exists such that a.s. for all and all ,
[TABLE]
Assumption 5 is very general and it was assumed in NediΔ [32]. For completeness we present next a lemma showing Assumption 5 holds in the relevant case in which the feasible set satisfies a standard metric regularity property, the number of constraints is finite (and possibly very large) and an i.i.d. uniform sampling of the constraints is chosen.
Lemma 3** (Sufficient condition for Assumption 5).**
Suppose and are independent sequences, and the following hold:
- (i)
The sequence is an i.i.d. sample of a random variable taking values on such that for some ,
[TABLE]
- (ii)
The set is metric regular: there is such that for all ,
[TABLE]
Then Assumption 5 holds with .
Proof.
Since and are independent and the βs are i.i.d., we have that for all , is independent of . Hence for all and ,
[TABLE]
using the fact that has the same distribution as in the second equality, Lemma 3(i) in the first inequality and Lemma 3(ii) in the last inequality. β
Item (i) above is satisfied when is uniform over , i.e., for all . As an example, item (ii) in Lemma 3 is satisfied for any compact convex set under a Slater condition, as proved by Robinson (see Pang [36]). A particular case of item (ii) occurs when for some and all ,
[TABLE]
In this case for and the method (33)-(34) may be rewritten as (15)-(16) assuming easy projections onto the soft constraints. Condition (36) is called linear regularity; see Bauschke and Borwein [4], Deutsch and Hundal [16]. As proved by Hoffman, (36) is satisfied for any polyhedron (see Pang [36]).
Assumption 6** (Small stepsizes).**
For all , , , and
[TABLE]
We remark here that the use of small stepsizes is forced by two factors: the use of approximate projections instead of exact ones, and the stochastic approximation. Indeed, even with exact projections, the method (33)-(34) still requires small stepsizes in order to guarantee asymptotic convergence.
Finally we state the weak-sharpness property assumed only in this section.
Assumption 7** (weak sharpness).**
There exists , such that for all and all ,
[TABLE]
3.3 Convergence analysis
We need the following lemma whose proof is immediate.
Lemma 4**.**
Suppose that Assumptions 3(i)-4 hold. Define the function as
[TABLE]
for any . Then, almost surely, for all , ,
[TABLE]
We now prove an iterative relation to be used in the convergence analysis. We mention that (40) is sufficient for the convergence analysis and includes the case of unbounded and . If the operator is bounded or is compact, then (42) allows an improvement of the convergence rate given in Section 3.4. In the following we define for all , and ,
[TABLE]
Lemma 5** (Recursive relations).**
Suppose that Assumptions 1-7 hold.
If Assumption 3(i) holds, then for all , and ,
[TABLE]
and
[TABLE]
If Assumption 3(ii) holds, then for all , and ,
[TABLE]
and
[TABLE]
Proof.
Take , and . We claim that
[TABLE]
[TABLE]
Indeed, by the definition of the method (33)-(34), we can invoke Lemma 2 with , , , , , , , and , obtaining (44).
We now take the conditional expectation with respect to in (44) obtaining,
[TABLE]
using and Assumption 4 in the first inequality, and Assumption 5 in the second inequality.
Next, we will bound the second term in the right hand side of (45). We write
[TABLE]
By monotonicity of (Assumption 2), the first term in the right hand side of (46) satisfies
[TABLE]
Regarding the second term in the right hand side of (46), the weak sharpness property (Assumption 7) and the fact that imply
[TABLE]
We now observe that so that
[TABLE]
[TABLE]
Concerning the third term in the right hand side of (46), we have
[TABLE]
using Cauchy-Schwarz inequality in the first inequality, and the definition of in Lemma 4 in the second inequality. Combining (47), (50) and (51) with (46), we finally get
[TABLE]
[TABLE]
From Lemma 4 and the fact that , we obtain
[TABLE]
Now we rearrange the last two terms in the right hand side of (53), using the fact that for any . With , and we get
[TABLE]
Putting together relations (53)-(55) and rearranging terms, we finally get (40), as requested.
Alternatively, we can replace (55) by the bound
[TABLE]
using the fact that with , and . Putting together relations (53)-(54) and (56) and rearranging terms, we get (41), as requested.
Suppose now that Assumption 3(ii) holds. In this case, the inequalities in (51) can be replaced by
[TABLE]
using Assumption 3(ii) and the fact that \|T(x^{*})\|^{2}\leq\mathbb{E}\big{[}\|F(v^{k},x^{*})\|^{2}\big{|}\mathcal{F}_{k}\big{]}\leq 2C_{F}^{2}, which follows from Jensenβs inequality, in the last inequality. Hence, combining (47), (50) and (57) we get, instead of (52),
[TABLE]
Using Assumption 3(ii) and (58) in (45) we get
[TABLE]
In view of Assumption 1, we define . Note that because is continuous and . From (59) we get
[TABLE]
using , and (59) in the second inequality. We rearrange now the last two terms in the right hand side of (60) (in a way similar to (55) or (56)), and obtain (42) or (43). β
Theorem 4** (Asymptotic convergence).**
Under Assumptions 1-7, method (33)-(34) generates a sequence which a.s. is bounded and In particular, a.s. all cluster points of belong to .
Proof.
We begin by imposing Assumption 3(i). Choose some (Assumption 1) and . By Assumption 6 and the definitions given in Lemma 5, we have , and , since , for for all . Hence, we can invoke (40) in Theorem 1 in order to to conclude that, a.s., converges and, in particular, is bounded.
In view of Assumption 1, we can define . We have because and is continuous. Since (40) in Lemma 5 holds for any and , we conclude that for all ,
[TABLE]
using relation (40) and in the second inequality.
We observe that the function defined in Lemma 4 is locally bounded because is continuous. Using this fact, the continuity of , the a.s.-boundedness of and , we conclude that and are a.s.-bounded. From the a.s.-boundedness of and and the conditions , and for all , which hold by Assumption 6, we conclude from Theorem 1 and (61) that a.s. converges, and
[TABLE]
By Assumption 6, we also have that , so that the above relation implies a.s. In particular, the sequence has a subsequence that converges to zero almost surely. Since a.s. converges, we conclude that the whole sequence a.s. converges to [math]. The proof under Assumption 3(ii) is similar, using (42). β
3.4 Convergence rate analysis
In this subsection we present convergence rate results for the method (33)-(34) under the weak sharpness property (37). The solvability metric will be while the feasibility metric will be . We define, for ,
[TABLE]
[TABLE]
where is the ergodic average of the iterates and is the window-based ergodic average of the iterates when the stepsizes are used to compute the weights. The solvability metric will be given in terms of or . The definitions of and are analogous, but using for computing the weights. The feasibility metric will be given in terms of such ergodic averages.
In order to obtain convergence rates for the case of an unbounded feasible set or unbounded constraint sets , we shall need the following proposition, which ensures that the sequence is bounded in . A typical situation is the case in which is a polyhedron, i.e. and the selected constraints are halfspaces, which have easily computable projections but are unbounded sets. If the uniform bound of Assumption 3(ii) holds, then sharper bounds are given in (68). We shall define for ,
[TABLE]
and for ,
[TABLE]
Proposition 2** (Boundedness in ).**
Suppose that Assumptions 1-7 hold.
Under Assumption 3(i), choose , and such that
[TABLE]
Then for all ,
[TABLE]
If Assumption 3(ii) holds, then for all ,
[TABLE]
Proof.
We first prove (67) under Assumption 3(i). Recall the definitions of and in (39). By Assumption 6, we can choose and such that (66) holds. Observe that , because , so that
[TABLE]
Fix and . Define
[TABLE]
For any , we take the total expectation and sum (40) from to , obtaining
[TABLE]
Given an arbitrary , define
[TABLE]
Suppose first that for all . Then by (66), (69) and (70) we get
[TABLE]
using the fact that in the definition of , and the definitions of , and . Hence
[TABLE]
using the fact that . Since is arbitrary, it follows that
[TABLE]
using again the fact that . In view of (70)-(71), we have a contradiction with the assumption that for any . Hence, there exists some such that , so that the set in the right hand side of (70) is empty. In this case we have . If , then (67) holds trivially, since . Otherwise, . From (66), (69), and the definitions of , , and , we have for all ,
[TABLE]
implying that so that
[TABLE]
using again . From (72) and the definitions of , and , we conclude that (67) holds.
We now prove (68) under Assumption 3(ii). As before, we define
[TABLE]
Taking total expectation in (42) and summing from [math] to , we get
[TABLE]
for all , using the fact that and the definitions of , , , and . We conclude from (73), the definitions of and , and the monotonicity of the sequences that (68) holds. β
Next we will give convergence rate results for the original sequence and for the ergodic average sequences. We consider separately the cases of unbounded operators (Assumption 3(i)) and the case of bounded ones (Assumption 3(ii)), because in the later case sharper rates are possible. In the remainder of this subsection, we refer the reader to definitions (38), (39), (62)-(63) and (64)-(65).
Theorem 5** (Solvability and feasibility rates of convergence: unbounded case).**
Suppose that Assumptions 1-7 and Assumption 3(i) hold. Choose , and such that
[TABLE]
Define for ,
[TABLE]
[TABLE]
Then a.s.-converges to [math] and the following holds:
- a)
For any , there exists , such that for all ,
- b)
For all and all ,
- c)
For any , there exists , such that for all ,
- d)
For all and all ,
Proof.
Fix , and as in (74). This is possible because converges to [math] as by Assumption 6. We now invoke Lemma 5. We take the total expectation in (40) and sum from to , obtaining, for every ,
[TABLE]
using and the definitions of , , , , and in the last inequality.
We now invoke Proposition 2. Setting , (66) can be rewritten as (74). From (67) and , we get, for all ,
[TABLE]
using the definitions of , and .
We prove now item (a). For every , define
[TABLE]
From the definition of we have, for every ,
[TABLE]
We claim that is finite. Indeed, if , then (77), (78) and (80) hold for and all . Hence, letting and using that and , which hold by Assumption 6, we obtain , which contradicts Assumption 6. Hence, the set in the right hand side of (79) is nonempty, which implies . Setting and in (77), (78) and (80), we get for all ,
[TABLE]
using the definition of . We thus obtain item (a).
We now prove item (b). In view of the convexity of the function , and the linearity and monotonicity of the expected value, we have
[TABLE]
Set , divide (77) by and use (81), the definition of together with (78), in order to bound , and obtain item (b) as a consequence.
The proofs of items (c) and (d) follow the proofline above, using (41) instead of (40). β
Corollary 1** (Solvability and feasibility rates with robust stepsizes: unbounded case).**
Assume that the hypotheses of Theorem 5 hold. Given and , define as: and for ,
[TABLE]
and choose , and . Take as the minimum natural number such that
[TABLE]
Define
[TABLE]
[TABLE]
Then a.s.-converges to [math] and the following holds:
- a)
For every , there exists such that
[TABLE]
- b)
For all ,
[TABLE]
- c)
For every , there exists such that
[TABLE]
- d)
For all ,
[TABLE]
Proof.
We estimate in (76). Since
[TABLE]
we conclude from (74) that it is enough to choose the minimum such that
[TABLE]
that is to say, the minimum such that (83) holds.
Let . We first estimate the sum of the stepsize sequence. For any we have
[TABLE]
using the fact that the minimum stepsize between and is . The sum of the squares of the stepsizes sequence can be estimated as
[TABLE]
We assume without loss on generality that we have in (79). Item (a) follows from (84) with and , (85), Theorem 5(a) and the definitions of , and .
Similarly, item (b) follows from (84)-(85) with , Theorem 5(b) and the definitions of and and the facts that and .
The proof of items (c) and (d) follows a similar proofline, using Theorem 5(c)-(d) and the fact that . β
Next we give convergence rates for the bounded case. For simplicity we just state the rates for the ergodic averages, but we note that similar rates can be derived for as in Theorem 5 and Corollary 1.
Theorem 6** (Solvability and feasibility rates: bounded case).**
Suppose that Assumptions 1-7 and Assumption 3(ii) hold. Choose . Define for in ,
[TABLE]
Then, a.s.-converges to zero and
- a)
For all ,
- b)
*If is compact, then for all with , *
- c)
for all ,
- d)
If is compact, then for all with ,
Proof.
Fix . We will invoke Lemma 5. We take the total expectation in (42) and sum from to , obtaining
[TABLE]
using the fact that and the definitions of , , , , and in last inequality. From (86) on, the proofs of items (a)-(b) are similar to the proof of Theorem 5. We omit the details, but make the following remarks: differently to the proofs of items (a)-(b) in Theorem 5, the proofs of items (a)-(b) of Theorem 6 do not require Proposition 2. In the proof of item (b), we use the bound in (86). The proofs of items (c)-(d) follow a similar proofline, using (43). β
Corollary 2** (Solvability and feasibility rates with robust stepsizes: bounded case).**
Assume that the hypotheses of Theorem 6 hold. Given and , define as: and for ,
[TABLE]
and choose , . Define
[TABLE]
Then a.s.-converges to [math] and
- a)
for all ,
[TABLE]
- b)
if is compact, then given , for all , it holds that
[TABLE]
- c)
For all ,
[TABLE]
Proof.
Item (a) follows from (84)-(85) with , Theorem 6(a), the definition of , and the facts that and .
The proof of item (c) follows a similar proofline, using Theorem 6(c) and .
We now prove item (b). Let , and set . We have and . We estimate
[TABLE]
[TABLE]
[TABLE]
using the inequality in the second inequality of (89) and in the second inequality of (90). Item (b) follows from (89)-(90), Theorem 6(b), the definition of and and the fact that . β
Remark 2**.**
Corollary 2(b) implies that, if is compact, then has a better performance than and when stepsizes as in (87) are used. Indeed, in Corollary 2(c), can be arbitrarily small, without affecting the constant in the convergence rate, and the βstochastic errorβ decays to zero. For unbounded operators, (83) in Corollary 1 suggests the use of and so that does not become too large. As an example, if , , , and , we have . For simplicity we do not state the analogous result for .
In Corollaries 1-2, stepsizes of are small enough to guarantee asymptotic a.s.-convergence and large enough as to ensure a rate of . If asymptotic a.s.-convergence of the whole sequence is not the main concern, we show next that one may use larger stepsizes of for ensuring convergence in (hence convergence in probability and a.s.-convergence of a subsequence) with a convergence rate of . When a constant stepsize is used in method (33)-(34), we can also give an error bound on the performance proportional to . Precisely, for fixed , we have and . Such error bounds rigorously justify the practical use of constant stepsizes in incremental methods for machine learning, where only an inexact solution is required.
Corollary 3** (Solvability and feasibility rates for larger stepsizes: bounded case).**
Assume that the hypotheses of Theorem 6 hold. Recall the definition of in Corollary 2. Choose , and .
- a)
If we choose a constant stepsize , then for all ,
[TABLE]
[TABLE]
- b)
If the total number of iterations is given a priori and for all , then
[TABLE]
[TABLE]
- c)
If is compact and we choose and for , then, given , for all ,
[TABLE]
[TABLE]
Proof.
Item (a) follows from Theorem 6(a) and (c) and the definitions of , , , , and . Item (b) follows from item (a). We prove now item (c). Take , and set . We have and . We estimate
[TABLE]
using the fact that the minimum stepsize between and is . We also estimate
[TABLE]
[TABLE]
[TABLE]
using and . Item (c) follows from (93)-(94), Theorem 6(b) and (d), the definitions of and and the fact that . β
We make a remark concerning the robustness of the stepsize sequence in Corollaries 1, 2 and 3 in the spirit of Nemirovski et al. [34]. The stepsizes presented above are robust in the sense that the knowledge of is not required and does not interrupt the advance of the method. Also, a scaling of in the stepsize implies a scaling in the convergence rate which is linear in or . Note that these properties hold true in the case of an unbounded operator with approximate projections.
We close this section by showing that, in the case of stochastic approximation, the weak sharpness property implies that after a finite number of iterations an auxiliary stochastic program with linear objective solves the original variational inequality. This recovers a similar property satisfied in the deterministic setting (see Marcotte and Zhu [31], Theorem 5.1). We estimate the minimum number of iterations in terms of the condition number , the variance and the distance of to the solution set, when is -Lipschitz continuous.
We emphasize that the auxiliary problem is still stochastic, an hence, even when is a polyhedron, we cannot conclude that a finite number of steps of a linear programming algorithm will be enough for finding a solution. It is not clear that switching to an SAA method for stochastics LPβs will be computationally more eficcient than continuing with our algorithm. Such issue requires extensive computational experimentation, which we intend to perform in a future work. Thus, for the time being we look at the next corollary as a possibly interesting theoretical property of weak-sharp SVIβs, i.e. an extension to the stochastic setting of Theorem 4.2 of [31].
Corollary 4** (A stochastic optimization problem).**
Suppose that is -HΓΆlder continuous with and
the assumptions of Corollary 1 hold with (unbounded case), or 2. 2.
the assumptions of Corollary 2 hold (bounded case).
Then, there exists , such that for all with we have
[TABLE]
Moreover, under condition 1,
[TABLE]
while, under condition 2,
[TABLE]
Proof.
Call . By the choice of , the definition of and Corollaries 1(b) and 2(a), we have
[TABLE]
From the HΓΆlder-continuity of ,
[TABLE]
using Jensenβs inequality in the first inequality, HΓΆlderβs inequality in third inequality and (95) in last inequality.
From Proposition 1, Assumption 7 and the equivalence between (25) and (26), we get that the Euclidean ball of center and radius is contained in By the convexity of the ball and Jensenβs inequality, we have
[TABLE]
From (96) and (97) we get that -\mathbb{E}[T(\widehat{x}^{k})]\in\operatorname*{int}\big{(}\bigcap_{x\in X^{*}}[\mathbb{T}_{X}(x)\cap\mathbb{N}_{X^{*}}(x)]^{\circ}\big{)}. Hence we conclude from Theorem 3 that
[TABLE]
Finally, we observe that \mathbb{E}\left[T(\widehat{x}^{k})\right]=\mathbb{E}\left[\mathbb{E}\left[F(v,\widehat{x}^{k})\big{|}\mathcal{F}_{k}\right]\right]=\mathbb{E}[F(v,\widehat{x}^{k})], using Assumption 4, and the property . The results follows from and (98). β
4 An incremental projection method with regularization for Cartesian SVI
In this section we shall study incremental projections, dropping the weak sharpness property of Section 3 and assuming only monotonicity of the operator. Additionally, we analyze the distributed version of the method, which includes the centralized case () in particular. For the sake of clarity, we present next the Cartesian and constraint structures in such framework.
4.1 Cartesian structure
We assume in this section that the stochastic variational inequality (1)-(2) has a Cartesian structure. We consider the decomposition with and furnish this Cartesian space with the standard inner product for and . We suppose that the feasible set has the form where each component is a closed and convex set for . We emphasize that the orthogonal projection under a Cartesian structure is simple: for and with and , we have
We assume the random variable takes the form , where corresponds to the randomness of agent , the random operator has the form with for . From (2), the mean operator has the form with for . Such framework includes stochastic multi-agent optimization and stochastic Nash equilibrium problems as special cases.
4.2 Constraint structure
In order to exploit the use of incremental projections (as in Section 3) in the Cartesian framework, we assume from now on that for , each Cartesian component of has the following form:
[TABLE]
where is a collection of closed and convex subsets of . Given , we assume that the projection operator onto is computationally easy to evaluate, and that for every , is representable in as
[TABLE]
for some convex function with domain . We denote the positive part of as for . We also assume that, for every , the subgradients of at points are easily computable and that is uniformly bounded over , i.e., there exists such that
[TABLE]
for all , all and all .
4.3 Statement of the algorithm
For problems endowed with the Cartesian structure and the constraint structure of Sections 4.1 and 4.2, our method advances in a distributed fashion for each Cartesian component , as in the incremental projection method (33)-(34) with an additional Tykhonov regularization (in order to cope with the plainly monotone case). Precisely, fix the Cartesian component . In a first stage, given the current iterate , the method advances in the direction with stepsize , after taking the sample and choosing the regularization parameter , producing an auxiliary iterate . In the second stage, a soft constraint is randomly chosen with the random control , and the method advances in the direction opposite to a subgradient of at the point with a stepsize , producing the next iterate . The iterates are collected in and the method continues. Formally, the method takes the form:
Algorithm 2** (Regularized incremental projection method: distributed case).**
Initialization:* Choose the initial iterate , the stepsize sequences and , the regularization sequence , the random control sequence and the operator sample sequence .* 2. 2.
Iterative step:* Given , define, for each ,*
[TABLE]
where if , and for any if .
The first stage (102) of the iterative step can be written compactly as
[TABLE]
where
[TABLE]
with and denotes the block-diagonal matrix in defined as
[TABLE]
with , and denoting the identity matrix for each .
4.4 Discussion of the assumptions
We consider the natural filtration
[TABLE]
Assumption 8**.**
We request Assumptions 1-4 and Assumption 3(i).
In this section we avoid the weak sharpness property assumed in Section 3. We now state the assumptions concerning the approximate projections which accommodate the Cartesian structure. In simple terms, we require each Cartesian component given by (99) to satisfy Assumption 5 of Section 3. This is formally stated in Assumption 9. Also, the agentsβ stepsizes and regularization sequences require a partial coordination specified in Assumption 10.
Assumption 9** (Constraint sampling and regularity).**
For each , there exists , such that a.s. for all and all ,
[TABLE]
We observe that Assumption 9 requires a sampling coordination between the control sequences for , since the filtration accumulates the history of the control sequence of every Cartesian component. The next lemma shows that this assumption is immediately satisfied if each agent has a metric regular decision set and the constraint sampling is independent between agents and uniform i.i.d. for each agent.
Lemma 6** (Sufficient condition for Assumption 9).**
Suppose that and are independent sequences, are independent for each and the following conditions hold: for each , and
- (i)
The sequence is an i.i.d. sample of a random variable taking values on such that for some ,
[TABLE]
- (ii)
The set is metric regular: there is such that for all ,
[TABLE]
Then Assumption 9 holds with for .
Proof.
Since and are independent, is independent and are independent for each , it follows that for all and , is independent of . The remainder of the proof follows the proof line of Lemma 3. β
Assumption 10** (Partial coordination of stepsizes and regularization sequences).**
For , consider the stepsize sequences and and the regularization sequence in Algorithm (102)-(103). Without loss of generality, for we add the term to the regularization sequence. We use the notation , for , , and . We then assume that and
- (i)
For each , is a decreasing positive sequence converging to zero.
- (ii)
* and *
- (iii)
.
- (iv)
* and*
[TABLE]
- (v)
**
Assumption 10 contains usual conditions on the regularization parameters of Tykhonov algorithms and on the stepsize for SA algorithms, with certain coordination across stepsizes and regularization parameters. Assumption 10 includes Assumption 2 in [29] with the addition of (105), due to the use of approximate projections (in addition to asynchronous stepsizes).111We observe that this condition is trivially satisfied with synchronous stepsizes, i.e., for all . Next we show that Assumption 10 is satisfied by explicit stepsizes and regularization parameters.
Corollary 5** (Asynchronous stepsizes and regularization parameters).**
Take and real numbers , . The following stepsizes and regularization parameters satisfy Assumption 10: for any and , take , , and
[TABLE]
Proof.
Except for condition (105), all other conditions in Assumption 10 are proved in Lemma 4 of [29]. We proceed with the proof of (105). Set , for and , . The claim is proved by showing that
[TABLE]
[TABLE]
β
4.5 Convergence analysis
We present next our convergence result for method (102)-(103). We shall need two lemmas.
Lemma 7** (Eventual strong-monotonicity).**
Consider Assumption 8. Define and Then for all and ,
Proof.
We consider the decomposition
[TABLE]
Concerning the second term in the right hand side of (106), if is the diagonal matrix with entries , then
[TABLE]
The first term in the right hand side of (106) is equal to
[TABLE]
The first term in the right hand side of (108) is nonnegative by monotonicity of . For the second term in the right hand side of (108), we have
[TABLE]
using Cauchy-Schwartz inequality in the first inequality, HΓΆlder-inequality in the third one and Lipschitz continuity of in the last one. The result follows from (106)-(109). β
We will use the following result, proved in Lemma 3 of Koshal et al. [29]:
Lemma 8** (Properties of the Tykhonov sequence).**
Assume that is convex and closed, that the operator is continuous and monotone over and that Assumption 1 hold. Assume also that the positive sequences for decrease to [math] and satisfy , with and . Denote by the solution of VI. Then
- (i)
* is bounded and all cluster points of belong to .*
- (ii)
The following inequality holds for all :
[TABLE]
where
[TABLE]
- (iii)
If then converges to the least-norm solution in .
Recalling (38), we define
[TABLE]
which is a finite quantity, because is bounded and is a locally bounded function. We also define the following constants for given :
[TABLE]
Next we prove the asymptotic convergence of method (102)-(103).
Theorem 7** (Asymptotic convergence).**
If Assumptions 8-10 hold, then the method (102)-(103) generates a sequence such that:
- (i)
if , then almost surely is bounded and all cluster points of belong to the solution set ,
- (ii)
if , then almost surely converges to the least-norm solution in .
Proof.
In the sequel we denote by the Tykhonov sequence of Lemma 8. Let . We claim that for all , , ,
[TABLE]
[TABLE]
Indeed, in view of (102)-(103) and , we can invoke Lemma 2 with , , , , , , , and obtaining (113).
We define for ,
[TABLE]
with . We use the definitions in (112) and sum the inequalities in (113) with between and , getting
[TABLE]
[TABLE]
Concerning the second term in the right hand side of (115), we have
[TABLE]
using the definitions in (104) and (114).
We now analyze the third term in the right hand side of (115). The triangular inequality and the inequality imply that
[TABLE]
Summing the inequalities in (117) with between and , we get from Assumption 3(i),
[TABLE]
Now we combine (115)-(118), in order to obtain
[TABLE]
[TABLE]
where and are defined as in (112).
The sum in the second term of the right hand side of (119) is equal to
[TABLE]
[TABLE]
Recalling the definition of in Assumption 10, it follows from Lemma 7 that the first term in the right hand side of (120) satisfies
[TABLE]
The second term in the right hand side of (120) is equal to
[TABLE]
[TABLE]
The first term in the right hand side of (122) equals
[TABLE]
Regarding the second term in the right hand side of (122), we use , so that for each we have
[TABLE]
using Cauchy-Schwartz inequality in the first inequality, Lemma 1(ii) for in the second one, the fact that in the third one, and the relation in the fourth one. Putting together (122)-(124), we finally get that the second term in the right hand side of (120) is bounded by
[TABLE]
For the third term in the right hand side of (120), we have
[TABLE]
Combining (121), (125) and (126) with (120), we obtain
[TABLE]
[TABLE]
[TABLE]
We use (127) in (119) and finally get the following recursive relation: for all and ,
[TABLE]
[TABLE]
[TABLE]
where for we define:
[TABLE]
In the sequel we specify and take in (129), getting
[TABLE]
[TABLE]
[TABLE]
For deriving (130), we use the facts that , and , which hold because is independent of and identically distributed to , and
[TABLE]
in view of the fct that .
Using the definition of in (112), we get from Assumption 9 and the fact that :
[TABLE]
By (131), the last term in the right hand side of (130) is bounded by
[TABLE]
using the fact that with , and .
Since solves VI, we have
[TABLE]
Next we relate to , using the properties of (Lemma 8). We have
[TABLE]
[TABLE]
Using the relation for any , the last term in the rightmost expression in (134) can be estimated as
[TABLE]
[TABLE]
Combining (130), (131)-(132), (133) and (136) we get
[TABLE]
[TABLE]
[TABLE]
We now estimate the coefficient in (137). In view of (129), we have
[TABLE]
Assumption 10(ii) and guarantee that
[TABLE]
Since is arbitrary, we can ensure the existence of such that
[TABLE]
for all sufficiently large . Next we show that for large . Indeed, from (139) and we have that for large enough , so that we obtain from (138),
[TABLE]
Finally, by Assumption 10(ii), so that (140) implies that for sufficiently large . Using this fact and (139) we get the following estimate:
[TABLE]
using (139) in the last inequality.
Combining (137), (141) and , we obtain
[TABLE]
for all sufficiently large , with and
[TABLE]
[TABLE]
From (141) and , we conclude that , while from Assumption 10(iii) we have that . From Assumption 10(iv) and (143), we also get that . Finally, using the definitions of and , we obtain from (143):
[TABLE]
for some positive constants , , and . Therefore, we get from Assumption 10(ii) and (v). These conditions, Theorem 2 and (142) imply that almost surely. The result follows from this fact and Lemma 8. β
4.6 Convergence rate analysis
Next we give feasibility and solvability convergence rates. The feasibility rate will be given in terms of the metric evaluated at
[TABLE]
i.e., the ergodic average of the iterates with weights . Assuming that is compact (but allowing the hard constraint to be unbounded), the solvability convergence rate will be given in terms of the dual gap function , defined in (24), evaluated at
[TABLE]
which is the ergodic average of the feasible projections of the iterates with weights . We shall use the notation for .
In the remainder of this subsection we recall definitions (35), (38), (110)-(112) and the ones given in Assumption 10. We first present the feasibility rate. In order to facilitate the presentation, we define some constants. Given and we set
[TABLE]
Theorem 8** (Feasibility rate).**
Suppose Assumptions 8-10 hold. Then given , for all ,
[TABLE]
Proof.
We recall relation (130) in the proof of Theorem 7. Instead of using (132), we bound the left hand side of (132) by
[TABLE]
using the facts that with , and .
We combine (130), (131), (133) and (136) with (147), take total expectation and sum from [math] to in order to get
[TABLE]
In view of the convexity of and the linearity of the expectation operator, we have
[TABLE]
Relations (148)-(149) prove the required claim. β
Next we present the solvability rate assuming that is compact. We will need the following definitions: for and ,
[TABLE]
[TABLE]
[TABLE]
We start with an intermediate lemma.
Lemma 9** (Feasibility error control).**
For any and ,
[TABLE]
Proof.
For , define
[TABLE]
We have
[TABLE]
using the fact that in the equality, (131) in the first inequality and the fact that with , and in the second inequality. We then take in (153) and use the fact that in order to obtain
[TABLE]
Proceeding by induction as in (153)-(154), we get
[TABLE]
Taking total expectation in (156) and using the fact that , we prove the claim. β
Theorem 9** (Solvability rate).**
Suppose that Assumptions 8-10 hold. Then, given , for all ,
[TABLE]
Proof.
We recall relation (128) in the proof of Theorem 7, where is defined in (104). Regarding the second line of (128), we have for any ,
[TABLE]
using Cauchy-Schwartz inequality and the definitions of and .
We set so that as in (24). Using (158) in (128) and then summing from [math] to , we get for all ,
[TABLE]
where the last line of (128) has been bounded using the definition of .
The total expectation of the term in the first line of (159) is bounded above by
[TABLE]
where in first line we used Lemma 4, , and for all and , in second line we used the property and and in third line we used \mathbb{E}\left[h_{i,\tau,\mu}(L(v^{i}))\big{|}\mathcal{F}_{i}\right]=\mathbb{E}\left[h_{i,\tau,\mu}(L(v^{i}))\right]=h_{i,\tau,\mu}(L) (using Assumption 4).
We will now bound the last term in the right hand side of (159). We define
[TABLE]
We define recursively as follows. Take any and set, for ,
[TABLE]
Note that . We write, for all ,
[TABLE]
Note that for all ,
[TABLE]
which follows from and (Assumption 4).
Concerning the first term in the right hand side of (161), we have
[TABLE]
using Lemma 1(iii) with the definition of and with and in the first inequality. Summing (163) from [math] to and then taking total expectation in (161) we get
[TABLE]
using the fact that and (162). Regarding the second term in the right hand side of (164), we have
[TABLE]
using the Lipschitz continuity of and , , and in the first inequality and that and in the second inequality. The third term in the right hand side of (164) is equal to
[TABLE]
using Cauchy-Schwartz inequality and the fact that , in the first inequality, the Lipschitz continuity of and in the second inequality, and that and in the third inequality.
From the convexity of , we get
[TABLE]
We are now ready to prove the claim. We take total expectation in (159) and combine it with (160) and (164)-(167). In order to complete the proof, we use the obtained relation, combine the expectation of the fifth term
[TABLE]
in the right hand side of (159) with (166) and use Lemma 9 with in order to obtain the final bound
[TABLE]
β
Corollary 6** (Solvability and feasibility rates: asynchronous parameters).**
*Suppose
that Assumptions 8-10 hold. Take stepsizes and regularization parameters as specified in Corollary 5. Then Theorem 7 and the following feasibility rate hold:*
[TABLE]
If additionally is compact, the following solvability rate holds: for any ,
[TABLE]
Proof.
The stated stepsizes and regularization parameters of Corollary 5 satisfy Assumption 10, so that a.s.-convergence follows from Theorem 7. In the sequel we fix .
We first establish the feasibility rate. We have
[TABLE]
The first inequality in (168) follows from (141), which implies that is negative for all sufficiently large . The remaining inequalities in (168)-(169) follow from Corollary 5 and from the boundedness of (see (140) in Theorem 7). The claimed feasibility rate follows from (168)-(169), Theorem 8 and the fact that .
We now establish the solvability rate. We have
[TABLE]
Also, is negative for sufficiently large (as shown by relation (140)) so . This, (168)-(170) and Theorem 9 prove the claim on the solvability rate. β
Appendix
Proof of Proposition 1:
Proof.
Suppose that (29) holds and take . If , then (26) holds trivially. Otherwise, take with . Since , the definition of implies that is a subset of the halfspace . In view of (20) and , there exist sequences , such that , and . We claim that, taking a subsequence if needed,
[TABLE]
for all . Indeed, otherwise we would have
[TABLE]
for large enough . Dividing (172) by and letting we get which entails a contradiction. Hence, (171) holds. From (29), and we get
[TABLE]
using (171) and the fact that in the second inequality. Dividing (173) by and letting , we conclude that (26) holds for .
Now suppose that (26) holds and that is constant on . Take , and let . Since and is closed and convex, we have that , using the first equality in (21). Since is monotone and is closed and convex, is closed and convex (see e.g. Facchinei and Pang [17], Theorem 2.3.5). From this fact, the fact that and Lemma 1(i), we obtain that , using the definition of the polar cone. Thus, . We conclude from (26) that
[TABLE]
Since is constant on , we have
[TABLE]
using the fact that , which holds because and . The desired claim (29) follows from (175) and (174). β
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Auslender, A. and Teboulle, M. (2005) Interior projection-like methods for monotone variational inequalities, Mathematical Programming, Ser. A , Vol. 104, pp. 39β68.
- 2[2] Bach, F. and Moulines, E. (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning, Advances in Neural Information Processing Systems (NIPS).
- 3[3] Bauschke, H.H. (2001) Projection algorithms: results and open problems. In: Butnariu, D., Censor, Y., Reich, Y. (eds.) Inherently Parallel Algorithms in Feasibility and Optimization and their Applications , Elsevier, Amsterdam, pp. 11β22.
- 4[4] Bauschke, H.H. and Borwein, J.M. (1996) On projection algorithms for solving convex feasibility problems, SIAM Review , Vol. 38, pp. 367β426.
- 5[5] Bauschke, H.H., Combettes, H.H., Luke, D.R. (2003) Hybrid projection-reflection method for phase retrieval, Journal of the Optical Socety of America , Vol. A 20, pp. 1025β1034.
- 6[6] Bello Cruz, J.Y. and Iusem, A.N. (2012) An explicit algorithm for monotone variational inequalities, Optimization , Vol. 61, pp. 855β871.
- 7[7] Bello Cruz, J.Y. and Iusem, A.N., (2010) Convergence of direct methods for paramonotone variational inequalities, Computational Optimization and Applications , Vol. 46, pp. 247β263.
- 8[8] Bello Cruz, J.Y. and Iusem A.N. (2015) Full convergence of an approximate projections method for nonsmooth variational inequalities, Mathematics and Computers in Simulation , Vol. 114, pp. 2β13.
