A Fully Stochastic Primal-Dual Algorithm
Pascal Bianchi, Walid Hachem, Adil Salim

TL;DR
This paper introduces a novel stochastic primal-dual algorithm designed for composite optimization problems where functions are given as unknown statistical expectations, with proven convergence to a saddle point.
Contribution
It presents a fully stochastic primal-dual method with convergence guarantees, extending the stochastic Forward Backward algorithm to new composite optimization settings.
Findings
Proven convergence to saddle points under stochastic conditions
Applicable to convex optimization with stochastic linear constraints
Utilizes recent advances in stochastic monotone operator theory
Abstract
A new stochastic primal--dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions/operators that enter the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d. realizations. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In the framework of the monotone operator theory, the convergence proof relies on recent results on the stochastic Forward Backward algorithm involving random monotone operators. An example of convex optimization under stochastic linear constraints is considered.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Fully Stochastic Primal-Dual Algorithm
Pascal Bianchi
LTCI, Télécom Paris, IP Paris, 75013, Paris, France.
Walid Hachem
LIGM, CNRS, Univ. Gustave Eiffel, F-77454 Marne-la-Vallée, France
Adil Salim
Visual Computing Center, KAUST, Saudi Arabia.
(27 January 2020)
Abstract
A new stochastic primal-dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions / operators that enter the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d realizations. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In the framework of the monotone operator theory, the convergence proof relies on recent results on the stochastic Forward Backward algorithm involving random monotone operators. An example of convex optimization under stochastic linear constraints is considered.
1 Introduction
Many applications in machine learning, statistics or signal processing require the solution of the following optimization problem. Given two Euclidean spaces and , solve
[TABLE]
where and are lower semicontinuous convex functions such that for every and belongs to the set of linear operators.
Assuming the truth of the qualification condition , where is the domain of a function and is the relative interior of a set, primal-dual methods generate a sequence of primal estimates and a sequence of dual estimates jointly converging to a saddle point of the Lagrangian function , where is the Fenchel conjugate of . There is a rich literature on such algorithms which cannot be exhaustively listed [10, 22, 14].
In this paper, it is assumed that the quantities that enter the minimization problem are unavailable or difficult to compute numerically, and have to be replaced with random quantities. Specifically, let be a probability space, and let and be two convex normal integrands (see below). Assume that and . In addition, let be a measurable function from to (i.e a random matrix), and assume that . Finally, assume that takes the form , where is a normal convex integrand. In order to solve Problem (1), no one of the objects , , and is available. Instead, the observer is given the functions , , , and , along with a sequence of independent and identically distributed (i.i.d.) random variables with the probability distribution . In this paper, a new stochastic primal dual algorithm based on this data is proposed to solve this problem. The convergence proof for this algorithm relies on the monotone operator theory. The algorithm is built around an instantiation of the stochastic Forward-Backward (FB) algorithm involving random monotone operators that was introduced in [6]. It is proven that the weighted means of the iterates of the algorithm, where the weights are given by the step sizes of the algorithm, converges almost surely to a saddle point of the Lagrangian function.
To our knowledge, the proposed algorithm is the first method that allows to solve Problem (1) in a fully stochastic setting with weak assumptions on the noise. Existing methods typically allow to handle subproblems of Problem (1) in which some quantities used in this problem are assumed to be available or set to zero [16, 20, 21, 23]. In particular, the new algorithm generalizes the stochastic gradient algorithm, the stochastic proximal point algorithm [17, 21, 5], and the stochastic proximal gradient algorithm [1, 8]. A close paper to ours is [11], which deals with a FB algorithm with deterministic monotone operators and random additive errors. In this reference, the convergence of the iterates is established under stringent summability conditions on these errors. Random block coordinate iterations combined with the FB algorithm were also considered in [13, 7, 12].
The next section is devoted to rigorously stating the problem and the main result. An application example is also considered. Section 3 is devoted to the proof of our main theorem.
Some notations.
The notation will refer to the Borel -field of . Both the operator norm and the Euclidean vector norm will be denoted as . The distance of a point to a set is denoted as . As mentioned above, we denote as the set of linear operators, identified with matrices, from to . The set of proper, lower semicontinuous convex functions on is . The set of real-valued –summable sequences is .
2 Problem description and main result
We start by recalling some mathematical definitions. Let be a probability space where the -field is -complete, and let be an Euclidean space. A function is said a convex normal integrand [19] if is convex, and if the set-valued mapping is closed-valued and measurable in the sense of [19, Chap. 14], where is the epigraph of a function. We shall always assume that for –almost all . Given , denote as the subdifferential of at . For , let be the space of the -measurable functions such that . If , set , otherwise,
[TABLE]
is the set of the so-called –integrable selections of the measurable set-valued function . Denoting as the closure of a set, the so-called selection integral of is the set
[TABLE]
that might be empty. Note that we use the same notation for these set-valued expectations and for the classical single-valued expectations.
We now state our problem. Let be a convex normal integrand, assume that for all , and consider the convex function which domain is . Let be a convex normal integrand, and let , where the integral is defined as the sum
[TABLE]
and
[TABLE]
and where the convention is used. The function is a lower semi continuous convex function if for all , which we assume. We shall assume that is proper. In a similar manner, let be a convex normal integrand, assume that belongs to , and let be its Fenchel conjugate (thus, ). Finally, let be an operator-valued measurable function, assume that is -integrable, and let .
Having introduced these functions, our purpose is to find a solution of Problem (1), where the set of such points is assumed non empty. To solve this problem, the observer is given the functions , and a sequence of i.i.d random variables from a probability space to with the probability distribution .
Denote as the Moreau’s proximity operator of a function . We also denote as the least norm element of the set , which is known to exist and to be unique [4]. Similarly, will refer to the least norm element of which was introduced above. We shall also denote as a measurable subgradient of at . Specifically, is a measurable function such that for each , , which is known to be non empty thanks to the integrability assumption [18]. A possible choice for is [6, §2.3 and §3.1]. Turning back to Problem (1), our purpose will be to find a saddle point of the Lagrangian . Denoting as the set of these saddle points, an element of is characterized by the inclusions
[TABLE]
Consider a sequence of positive weights . The algorithm proposed here consists in the following iterations applied to the random vector .
[TABLE]
The convergence of Algorithm (4) is stated by the next theorem in terms of weighted averaged estimates
[TABLE]
Theorem 2.1
Consider Problem (1), and let the following assumptions hold.
The step size sequence satisfies , and as . 2. 2.
The function satisfies for each . 3. 3.
There exists an integer that satisfies the following conditions:
- •
The function is in .
- •
There exists a point , and three functions , , and such that
[TABLE]
Moreover, for every point , there exist three functions , , and such that (5) holds. 4. 4.
For any compact set , there exist and such that
[TABLE] 5. 5.
There exists a measurable function such that is -integrable, where is the integer provided by Assumption 3, and such that for all ,
[TABLE]
*Moreover, there exists a constant such that . * 6. 6.
Writing , there exists such that for all ,
[TABLE] 7. 7.
There exists such that for any and any ,
[TABLE]
where is the projection operator onto , and where is the integer provided by Assumption 3. 8. 8.
Assumptions 2, 4, 6 and 7 hold true when the function is replaced with and the space is replaced with .
Then, the sequence is bounded in and the sequence converges almost surely (a.s.) to a random variable supported by .
Let us now discuss our assumptions. Assumption 1 is standard in the decreasing step case. Assumption 2 requires that the interchange of the expectation and the subdifferentiation be possible. Let us provide some sufficient conditions for this to be true. By [18], this will be the case if the following conditions hold: i) the set-valued mapping is constant -a.e., where is the domain of , ii) whenever -a.e., iii) there exists at which is finite and continuous. Another case of practical importance where this interchange is permitted is the following. Let be a positive integer, and let be a collection of closed and convex subsets of . Let be non empty, and assume that the normal cone of at satisfies the identity for each , where the summation is the usual set summation. As is well known, this identity holds true under a qualification condition of the type (see also [3] for other conditions). Now, assume that and that is an arbitrary probability measure putting a positive weight on each . Let be the indicator function
[TABLE]
Then it is obvious that is a convex normal integrand, , and . We can also combine these two types of conditions: let be a probability space, where is -complete, and let be a convex normal integrand satisfying the conditions i)–iii) above. Consider the closed and convex sets introduced above, and let be a probability measure on the set such that for each . Now, set , , and define as
[TABLE]
where . Then it is clear that
[TABLE]
and
[TABLE]
Assumption 3 is a moment assumption that is generally easy to check. Note that this assumption requires the set of saddle points to be non empty. Notice the relation between Equations (5) and the two inclusions in (3). Focusing on the first inclusion and using Assumption 2, there exist and such that . Then, Assumption 3 states that and can be taken in such a way that there are two measurable selections and of and respectively which are both in and which satisfy and . A sufficient condition for the existence of the selections satisfying Assumption 3 is the following [8]: there exists an open neighborhood of and an open neighborhood of such that , and , and , . Note also that the larger is , and the weaker is Assumption 7.
Assumption 4 is relatively weak and easy to check. It is interesting to compare it with Assuption 5. It is indeed much weaker than the latter, which assumes that the growth of is not faster than linear. This is due to the fact that and enter the algorithm (4) through the proximity operator while the function is used explicitly in this algorithm (through its (sub)gradient). This use of the functions is reminiscent of the well-known Robbins-Monro algorithm, where a linear growth is needed to ensure the algorithm stability. Note that Assumption 5 is satisfied under the more restrictive assumption that is -Lipschitz continuous without any bounded gradient assumption.
Assumption 6 is quite weak, and is studied e.g in [15]. This assumption is easy to illustrate in the case where as in (6). Following [3], we say that the subsets are linearly regular if there exists such that for every ,
[TABLE]
Sufficient conditions for a collection of sets to satisfy the above condition can be found in [3] and the references therein. Note that this condition implies that . Let us finally discuss Assumption 7. As , it is known that converges to for every . Assumption 7 provides a control on the convergence rate. This assumption holds under the sufficient condition that for -almost every and for every ,
[TABLE]
where is a positive random variable with a finite fourth moment [5].
We now consider an application example of Theorem 2.1.
Example 1
Let . Setting , where is the indicator function of the set , Problem (1) boils down to the linearly constrained problem
[TABLE]
If we assume that where is a random vector, then our problem amounts to randomizing the constraints and to handling these stochastic constraints online. Such a context is encountered in various fields of machine learning, as the Neyman-Pearson classification, or in online so-called Markowicz portfolio optimization.
Since , we simply need to put , and Algorithm (4) becomes:
[TABLE]
To go further, let us particularize Problem (7) to the case of the Markowicz portfolio optimization, and check the assumptions of Theorem 2.1 to complete the picture. In this case, is a –valued random variable with a second moment, , where is the probability simplex, , and is some real positive number. Note that it is usually assumed that is fully known or estimated, which we don’t do here. We of course assume that the qualification condition holds true.
Assumptions 2 and 4 of the statement of Theorem 2.1 are immediate for both and . One can check that Assumption 3 is satisfied for if we assume that , which also ensures the truth of Assumption 5. Assumptions 6 and 7 are trivially satisfied for and , since , and since has a full domain.
3 Proof of Theorem 2.1
The proof of Theorem 2.1 makes use of the monotone operator theory. We begin by recalling some basic facts on monotone operators. All the results below can be found in [9, 4] without further mention.
A set-valued mapping on the Euclidean space will be called herein an operator. An operator with singleton values is identified with a function. As above, the domain of is . The graph of is . The operator is said monotone if , . A monotone operator with non empty domain is said maximal if is a maximal element for the inclusion ordering in the family of the monotone operator graphs. Let be the identity operator, and let be the inverse of , which is defined by the fact that . An operator belongs to the set of the maximal monotone operators on if and only if for each , the so-called resolvent is a contraction defined on the whole space . In particular, it is single-valued. A typical element of is the subdifferential of a function . In this case, the resolvent for coincides with the proximity operator . A skew-symmetric element of can also be checked to be an element of .
The set of zeros of an operator on is the set . The sum of two operators and is the operator whose image at is the set sum of and . Given two operators , where is single-valued with domain , the FB algorithm is an iterative algorithm for finding a point in . It reads
[TABLE]
where is a positive step.
In the sequel, we shall be interested by random elements of as used in [5, 6, 8]. A random element of is a measurable function in the sense of [2], where is the probability space introduced at the beginning of Section 2. In particular, when is a convex normal integrand such as is proper -a.e., is a random element of . Moreover, when is a skew-symmetric element of which is measurable in the usual sense (as a function), then it is also a random element of . If we fix and we denote as its image by , then the set-valued function is measurable, and its (set-valued) expectation is defined similarly to Equation (2) [2, 5, 6]. Note that is monotone but not necessarily maximal.
We now enter the proof of Theorem 2.1. Let us set , and endow this Euclidean space with the standard scalar product. By writing , it will be understood that and . For each , define the set-valued operator on as the operator that takes to
[TABLE]
Fixing , the operator coincides with the subdifferential of the convex normal integrand with respect to . Thus, is a random element of . Let us also define the operator as
[TABLE]
We can write , where
[TABLE]
( is a linear skew-symmetric operator written in a matrix form in ). For each , both these operators belong to , and . Thus, by [4, Cor. 24.4]. Moreover, since both and are measurable, is a random element of .
Since is Lebesgue-integrable for all by construction, it is known that [18]. Moreover, and by Assumptions 2 and 8. Thus, the operators and can be written as
[TABLE]
thus, these monotone operators are both maximal. By [4, Cor. 24.4], we also get that belong to . Moreover, recalling the system of inclusions (3), we also obtain that .
Defining the function
[TABLE]
(obviously, -a.e.), let us consider the following version of the FB algorithm
[TABLE]
On the one hand, one can easily check that this is exactly Algorithm (4). On the other hand, this algorithm is an instance of the random FB algorithm studied in [6]. By checking the assumptions of Theorem 2.1 one by one, one sees that the assumptions of [6, Th. 3.1 and Cor. 3.1] are verified. Theorem 2.1 follows.
Remark 1
The convergence stated by Theorem 2.1 concerns the averaged sequence . One can ask whether the sequence itself converges to . This would happen if the operator were so-called demipositive [6]. This happens when, e.g., is strongly convex and is smooth (proof omitted). Unfortunately, demipositivity of is not always guaranteed.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. F. Atchadé, G. Fort, and E. Moulines. On perturbed proximal gradient algorithms. Journal of Machine Learning Research , 18(1):310–342, 2017.
- 2[2] H. Attouch. Familles d’opérateurs maximaux monotones et mesurabilité. Annali di Matematica Pura ed Applicata , 120(1):35–111, 1979.
- 3[3] H. H. Bauschke, J. M. Borwein, and W. Li. Strong conical hull intersection property, bounded linear regularity, Jameson’s property (G), and error bounds in convex optimization. Mathematical Programming , 86(1):135–160, 1999.
- 4[4] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
- 5[5] P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization , 26(4):2235–2260, 2016.
- 6[6] P. Bianchi and W. Hachem. Dynamical behavior of a stochastic forward-backward algorithm using random monotone operators. Journal of Optimization Theory and Applications , 171(1):90–120, 2016.
- 7[7] P. Bianchi, W. Hachem, and F. Iutzeler. A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Transactions on Automatic Control , 61(10):2947–2957, Oct 2016.
- 8[8] P. Bianchi, W. Hachem, and A. Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. Journal of Convex Analysis , 26(2):397–436, 2019.
