Corporative Stochastic Approximation with Random Constraint Sampling for Semi-Infinite Programming
Bo Wei, William B. Haskell, Sixiang Zhao

TL;DR
This paper introduces a new stochastic approximation algorithm for semi-infinite programming that handles inexact constraint solving and achieves optimal convergence rates under convexity assumptions.
Contribution
It proposes a novel CSA algorithm with random constraint sampling schemes and provides convergence guarantees for convex and strongly convex cases.
Findings
Achieves an $ ext{O}(1/\sqrt{N})$ convergence rate for convex functions.
Improves to an $ ext{O}(1/N)$ rate for strongly convex functions.
Provides error bounds for inexact CSA in semi-infinite programming.
Abstract
We developed a corporative stochastic approximation (CSA) type algorithm for semi-infinite programming (SIP), where the cut generation problem is solved inexactly. First, we provide general error bounds for inexact CSA. Then, we propose two specific random constraint sampling schemes to approximately solve the cut generation problem. When the objective and constraint functions are generally convex, we show that our randomized CSA algorithms achieve an rate of convergence in expectation (in terms of optimality gap as well as SIP constraint violation). When the objective and constraint functions are all strongly convex, this rate can be improved to .
| Adaptive sampling | Fixed constraint sampling | Optimal value | ||||
|---|---|---|---|---|---|---|
| Objective values | ||||||
| Relative gaps | - | |||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Complexity and Algorithms in Graphs · Risk and Portfolio Optimization
Corporative Stochastic Approximation with Random Constraint Sampling
for Semi-Infinite Programming
Bo Wei, William B. Haskell, and Sixiang Zhao
Abstract
We developed a corporative stochastic approximation (CSA) type algorithm for semi-infinite programming (SIP), where the cut generation problem is solved inexactly. First, we provide general error bounds for inexact CSA. Then, we propose two specific random constraint sampling schemes to approximately solve the cut generation problem. When the objective and constraint functions are generally convex, we show that our randomized CSA algorithms achieve an rate of convergence in expectation (in terms of optimality gap as well as SIP constraint violation). When the objective and constraint functions are all strongly convex, this rate can be improved to .
1 Introduction
In this paper, we combine the corporative stochastic approximation (CSA) method developed in [29] with inexact cut generation for semi-infinite programming (SIP). In particular, we focus on random sampling methods to approximately solve the SIP cut generation problem. The SIP cut generation problem is usually non-linear and non-convex, so it is difficult to solve it to global optimality deterministically. Two specific random constraint sampling schemes are proposed to overcome this difficulty, and the randomized CSA algorithms demonstrate good performance to solve SIP with theoretically guaranteed convergence rates.
1.1 Previous work
We refer the reader to [4, 17, 22, 35, 47] for recent detailed overviews of SIP. The main computational difficulty in SIP comes from the infinitely many constraints, and several practical schemes have been proposed to remedy this difficulty [15, 16, 35, 45]. We offer the following very rough classification of SIP methods based on [22, 35, 45].
Exchange methods: In exchange methods, in each iteration a set of new constraints is exchanged for the previous set (there are many ways to do this). Cutting plane methods are a special case where constraints are never dropped. The algorithm in [19] is the prototype for several SIP cutting plane schemes, and it has been improved in various ways [2, 27, 37]. In particular, a new exchange method is proposed in [49] that only keeps those active constraints with positive Lagrange multipliers. New constraints are selected using a certain computationally-cheap criterion. In [37], the earlier central cutting plane algorithm from [27] is extended to allow for nonlinear convex cuts.
Randomized cutting plane algorithms have recently been developed for SIP in [5, 6, 12]. The idea is to input a probability distribution over the constraints, randomly sample a modest number of constraints, and then solve the resulting relaxed problem. Intuitively, as long as a sufficient number of samples of the constraints is drawn, the resulting randomized solution should violate only a small portion of the constraints and achieve near optimality.
Discretization methods: In the discretization approach, a sequence of relaxed problems with a finite number of constraints is solved according to a predefined or adaptively controlled grid generation scheme [44, 48]. Discretization methods are generally computationally expensive. The convergence rate of the error between the solution of the SIP problem and the solution of the discretized program is investigated in [48].
Local reduction methods: In the local reduction approach, an SIP problem is reduced to a problem with a finite number of constraints [18]. The reduced problem involves constraints which are defined only implicitly, and the resulting problem is solved via the Newton method which has good local convergence properties. However, local reduction methods require strong assumptions and are often conceptual.
Dual methods: A wide class of SIP algorithms is based on directly solving the KKT conditions. In [25, 33, 34], the authors derive Wolfe’s dual for an SIP and discuss numerical schemes for this problem. The KKT conditions often have some degree of smoothness, and so various Newton-type methods can be applied [30, 39, 42, 43]. However, feasibility is not guaranteed under the all Newton-type methods. A new smoothing Newton-type method is proposed to overcome this drawback in [32].
Applications: SIP is the basis of the approximate linear programming (ALP) approach for dynamic programming. Randomly sampling state-action pairs is shown to give a tractable relaxed linear programming problem, as explored in [3, 11, 13]. In [3, 13], the sampling distribution is assumed to be the occupation measure corresponding to the optimal policy. In [31], an adaptive constraint sampling approach called ’ALP-Secant’ is developed which is based on solving a sequence of saddle-point problems. It is shown that ALP-Secant returns a near optimal ALP solution and a lower bound on the optimal cost with high probability in a finite number of iterations.
Many risk-aware optimization models also depend on SIP (e.g. [40, 41]), in particular, risk-constrained optimization (e.g. [7, 8, 9, 10, 21, 23, 24]). In [7, 8, 9, 20], a duality theory for stochastic dominance constrained optimization is developed which shows the special role of utility functions as Lagrange multipliers. Relaxations of multivariate stochastic dominance have been proposed based on various parametrized families of utility functions, see [9, 20, 23, 24]. Computational aspects of the increasing concave stochastic dominance constrained optimization are discussed in [21, 23, 24].
1.2 Contributions
We summarize our main contributions in this work as follows:
We give error bounds for inexact CSA (where the cut generation problem is solved inexactly). These error bounds are general, and may form the basis for the convergence analysis of many CSA-type algorithms. 2. 2.
We develop two specialized CSA algorithms where random sampling is used to approximately solve the cut generation problem. The first algorithm is based on using a fixed sampling distribution, in line with [5, 6, 12]. Intuitively, as long as a sufficiently large number of samples is drawn, the resulting randomized solution should violate only a "small portion" of the constraints. The second algorithm is based on adaptively sampling the constraints based on information from the current iterate. In particular, we compute the analytical solution of a regularized cut generation problem for the current iterate, and then use this distribution to do adaptive sampling. 3. 3.
We provide a stochastic convergence analysis for both our specialized CSA algorithms based on our general error bounds. We show that as the errors in cut generation decrease at appropriate rates, our specialized CSA algorithms achieve the same convergence rate as in the error-free case. When the objective and constraint functions are convex, both algorithms achieve an rate of convergence in expectation, in terms of optimality gap and constraint violation. If the objective and constraint functions are strongly convex, this rate can be improved to .
This paper is organized as follows. We first provide preliminary material in Section 2. The following Section 3 describes a general inexact CSA algorithm, and then provides error bounds (in terms of the error in solving each cut generation problem). Next, in Section 4, we give the formal details for our two specialized CSA algorithms and report their convergence rates. For clearer organization, the detailed proofs of all our results are gathered together in Section 5. We then present some numerical experiments for CSA with random sampling in Section 6. Finally, we conclude the paper in Section 7 with a discussion of further issues and future research.
Notation
We make use of the following basic notation throughout the paper. For , the ceiling function returns the smallest integer greater than or equal to . The Euclidean norm and inner product on are and , respectively. The Euclidean ball with radius centered at is . For a function , we denote its subdifferential by and a subgradient of at by , respectively.
We also make use of the following further notation. For any set , is the space of probability distributions on . The Kullback-Liebler divergence is
[TABLE]
for probability densities . For any integer , we denote the Cartesian product of by . Finally, for any probability distribution over set , the product measure and the associated expectation on are denoted by and , respectively.
2 Preliminaries
We begin our discussion of SIP with the following problem ingredients:
A1
Convex, compact decision set ;
A2
Convex objective function , which is Lipschitz continuous with constant ;
A3
Compact constraint index set ;
A4
Constraint function , such that for each , is convex and Lipschitz continuous with constant ;
A5
For all , is Lipschitz continuous with constant .
We write the constraints as a single function . The resulting semi-infinite programming problem is:
[TABLE]
Problem (1) is a convex optimization problem under Assumptions A1, A2, and A4. Formally, we also assume that Problem (1) is solvable.
Assumption 2.1**.**
An optimal solution of Problem (1) exists.
To continue, we recall some fundamental concepts of convex analysis.
Definition 2.2**.**
A function is strongly convex with parameter , if for any we have
[TABLE]
The distance generating function and its associated prox-function are defined as follows.
Definition 2.3**.**
(i) A function is a distance generating function with parameter , if is continuously differentiable and strongly convex with parameter .
(ii) (Bregman’s distance) The prox-function associated with is .
(iii) The prox-mapping is .
Without loss of generality, we may assume that in part (i) of the preceding definition since we can always re-scale to become . The distance generating function gives a measure of the diameter of , i.e. . Clearly, the diameter satisfies as long as is bounded.
We assume that the prox-function is chosen such that the prox-mapping can be easily computed. The next result follows from the definition of the prox-function.
Lemma 2.4**.**
[38, Lemma 2.1]** For every and , we have
[TABLE]
3 General Error Bounds for Inexact CSA
In this section, we derive general error bounds for inexact CSA applied to Problem (1). These error bounds form the basis of our convergence analysis for the two specialized CSA algorithms that we consider in the next section.
The (general) CSA algorithm works as follows. We let denote the sequence of iterates of the algorithm, a sequence of step-sizes with all , and a sequence of error tolerances for constraint violation with all . At each iteration , we need to solve the cut generation problem
[TABLE]
to determine if is feasible or to identify any violated constraints. After we obtain
[TABLE]
CSA performs a projected subgradient step with step-size along either or , depending on whether the condition is satisfied (i.e. depending on whether the constraint violation is below our error tolerance or not).
Let denote the total number of iterations of the algorithm. For some , we may partition the indices
[TABLE]
into two subsets:
[TABLE]
The set counts those iterations within for which the constraint violation of corresponding to is less than our tolerance . When the algorithm terminates, it returns the weighted average
[TABLE]
of iterates over (which only indexes those iterates where we believe the constraint violation is small). The general inexact CSA algorithm is summarized in Algorithm 1.
The cut generation problem is typically a non-convex optimization problem. Generally speaking, there is no fast algorithm that can solve this problem deterministically. In our case, the error in each iteration comes from inexact solution of . We denote the error in cut generation as
[TABLE]
Note that the errors are always nonnegative since for all by definition.
Below we give a specific selection of the parameters , , and to be used in Algorithm 1:
[TABLE]
for all . The following result shows that is well-defined under this policy.
Lemma 3.1**.**
Suppose is generated by Algorithm 1 with policy (4), then the set , i.e., is well-defined.
Now we will bound the optimality gap and constraint violation of in terms of the errors from inexact cut generation. The result of Theorem 3.2 is online since policy (4) does not depend on knowing in advance, and thus we may stop or continue the algorithm anytime. In particular, the weighted average from Theorem 3.2 gives decreasing weight to older iterates .
Theorem 3.2**.**
Suppose is generated by Algorithm 1 with policy (4), then for any we have
[TABLE]
and
[TABLE]
Remark 3.3*.*
The bound on the optimality gap does not depend on the errors in cut generation, since objective function evaluations are error free (in contrast to inexact evaluation of the constraint function ).
We can improve the convergence rate when the objective function and the constraint functions are all strongly convex. To proceed, we introduce a new assumption on the quadratic growth of the prox-function .
Assumption 3.4**.**
(i) The objective function is strongly convex with parameter , and the constraint functions are all strongly convex with parameter (uniformly in all ).
(ii) There exists , such that .
The constants in Assumption 3.4 appear in our parameter selection policy for the strongly convex case. For all , let be the step-sizes used in our algorithms, and denote
[TABLE]
For the strongly convex case, the output of Algorithm 1 is modified to
[TABLE]
Our new policy is given as follows: for
[TABLE]
The following result shows that is well-defined for this policy as well.
Lemma 3.5**.**
Suppose Assumption 3.4 holds. Suppose is generated by Algorithm 1 with policy (5), then the set , i.e., is well-defined.
Now we give an improved error bound for inexact CSA under policy (5) for the strongly convex case.
Theorem 3.6**.**
Suppose Assumption 3.4 holds. Let be generated by Algorithm 1 with policy (5), then for any we have
[TABLE]
and
[TABLE]
Remark 3.7*.*
In the strongly convex case, the convergence rate may be improved to if the errors in cut generation decrease at appropriate rate.
4 Random Constraint Sampling
As we have already pointed out, the cut generation Problem (2) is a general nonlinear non-convex optimization problem, and there is no fast algorithm that can solve such a problem deterministically. In this section, we describe two random constraint sampling schemes that can approximately solve the cut generation problem. The first scheme is based on sampling from a fixed probability distribution (Subsection 4.1), while the second scheme is based on sampling adaptively from a probability distribution that is updated in each iteration based on the current iterate (Subsection 4.2).
4.1 Fixed Constraint Sampling
In this subsection, we approximately solve the cut generation Problem (2) by sampling from a fixed distribution on . To begin, we take a probability distribution on as user input. To solve Problem (2) at iteration , we let (where is the sample size for all ) be independent identically distributed (i.i.d.) samples from generated according to . Then, we define
[TABLE]
to be the element among which maximizes .
We need the following assumption on the sampling distribution .
Assumption 4.1**.**
There exists a strictly increasing function such that , for all and all open balls .
The above assumption means that has support on all of , it also appears in Proposition 3.8 of [12]. For more discussion, the reader is referred to Assumption 3.1 of [26].
Intuitively, as long as the number of samples is large enough, we expect will be close to with high probability with respect to . We have a result in expectation for the approximation quality. For , in , we define
[TABLE]
which will appear in the next result to denote the threshold of sample size. Denote the lower bound and upper bound of over as and (due to the continuity of , and the compactness of and ), respectively, i.e.,
[TABLE]
Proposition 4.2**.**
Suppose Assumption 4.1 holds. Given , for i.i.d. samples generated from , we have .
We now investigate the convergence of inexact CSA based on this fixed sampling scheme. We define
[TABLE]
to be the probability distribution of the samples \Big{\{}\{\delta_{k}^{(i)}\}_{i=1}^{M_{k}}\Big{\}}_{k\in\mathcal{B}} on the space .
Theorem 4.3**.**
Suppose Assumption 4.1 holds. Suppose is generated by Algorithm 1 under policy (4). Take , and for all . Then, for any , we have
[TABLE]
and
[TABLE]
In view of Theorem 4.3 we see that inexact CSA with fixed random constraint sampling achieves an rate of convergence in expectation (with respect to ) for solving Problem (1) in the general convex case. Next we consider an improved convergence rate for the strongly convex case.
Theorem 4.4**.**
Suppose Assumptions 3.4 and 4.1 hold. Suppose is generated by Algorithm 1 under policy (5). Take , and for all . Then, for any , we have
[TABLE]
and
[TABLE]
4.2 Adaptive Constraint Sampling
In this subsection, we consider an alternative adaptive constraint sampling scheme for the cut generation Problem (2). In particular, in iteration we will construct a constraint sampling distribution that is tailored to the current iterate . More specifically, for any and , we want to find a probability distribution (which depends on and ) on , such that
[TABLE]
which guarantees that the samples generated from this distribution are very likely to solve our cut generation Problem (2). Then, in each iteration we will construct such a distribution from , and use it to guide our next round of random constraint sampling.
To continue, we introduce a new assumption on the set .
Assumption 4.5**.**
The set is full dimensional and convex.
The following preliminary lemma is key for our adaptive sampling scheme. It establishes an equivalence between the general nonlinear finite-dimensional optimization problem and the infinite-dimensional linear optimization problem
[TABLE]
in probability distributions.
Lemma 4.6**.**
For all , .
Let denote the uniform probability distribution on , that is,
[TABLE]
We define a regularized cut generation problem as follows,
[TABLE]
where is the regularization parameter. The mapping is convex, thus the regularized cut generation Problem (6) is an infinite-dimensional convex optimization problem. We can expect that if the regularization parameter is small enough, the solution of the regularized Problem (6) provides useful information to solve our cut generation Problem (2).
We will show that the regularized cut generation Problem (6) is well defined. In particular, we show that the maximizer (which depends on and )
[TABLE]
is attained and is given in closed form. The next lemma is based on calculus of variations.
Lemma 4.7**.**
For any and , the maximizer of Problem (6) is attained, and it is
[TABLE]
Since is full dimensional, we may let be the radius of the largest ball which can be included in . Specifically, there exists such that the Euclidean ball . Define
[TABLE]
to be the ratio between the volume of the largest such ball and the volume of (necessarily ). The following result demonstrates that the gap between the cut generation Problem (2) and its regularization (6) can be made arbitrarily small through our control of . Let denote the Euclidean diameter of and define . We also define
[TABLE]
Proposition 4.8**.**
Suppose Assumption 4.5 holds and choose . For any ,
[TABLE]
and
[TABLE]
From (7), we see that the solution of the regularized cut generation Problem (6) provides a solution of the inequality .
The adaptive constraint sampling scheme works as follows. Suppose we are given tolerances with for all . At iteration , we sample from the probability density . Let (with ) be i.i.d. samples from generated according to , and again define to be a maximizer of .
Proposition 4.9**.**
Suppose Assumption 4.5 holds. Given and , for any , let be i.i.d. samples from , then .
Now we consider the convergence rate of the adaptive sampling scheme. We define the distribution
[TABLE]
of the samples \Big{\{}\{\delta_{k}^{(i)}\}_{i=1}^{M_{k}}\Big{\}}_{k\in\mathcal{B}} on the space .
Theorem 4.10**.**
Suppose Assumption 4.5 holds. Suppose is generated according to Algorithm 1 with policy (4). For each and , we generate i.i.d. samples according to . Then, for any ,
[TABLE]
and
[TABLE]
As for the fixed sampling scheme, we find an improved convergence rate for the strongly convex case under the adaptive sampling scheme as well.
Theorem 4.11**.**
Suppose Assumptions 3.4 and 4.5 hold. Suppose is generated according to Algorithm 1 with policy (5). For each and , we generate i.i.d. samples according to . Then, for any ,
[TABLE]
and
[TABLE]
In view of Theorem 4.11, inexact CSA with adaptive sampling achieves an rate of convergence in expectation, in terms of the optimality gap and constraint violation, in the strongly convex case.
Remark 4.12*.*
Through Proposition 4.2 and Proposition 4.9, we see two major differences between the fixed sampling and adaptive sampling schemes. First, the fixed sampling scheme requires batch samples, while only one sample per iteration is needed to make the adaptive sampling scheme work due to the inequality . Of course, we get better performance if we use batch sampling under the adaptive sampling since we always have for each . Second, depends on the error tolerances and the current iterates under the adaptive sampling, while does not under the fixed sampling scheme. There is a trade-off between the two sampling schemes. Under the fixed sampling scheme, we do not need to change the sampling distribution iteration by iteration, but it requires batch samples to achieve a desired cut generation tolerance. Under the adaptive sampling scheme, we need to generate different sampling distributions at different iterations, but the required number of samples is much smaller.
5 Proofs of Main Results
In this section, we provide the proofs for our main results. In Subsection 5.1, we establish the general error bounds for inexact CSA (Theorem 3.2 for the generally convex case and Theorem 3.6 for the strongly convex case). The details of the fixed sampling cut generation scheme (Proposition 4.2) and the corresponding CSA convergence results (Theorems 4.3 and 4.4) are in Subsection 5.2. All material for the adaptive sampling cut generation scheme (Proposition 4.9) and the corresponding CSA convergence results (Theorems 4.10 and 4.11) are in Subsection 5.3.
5.1 General Error Bounds Analysis
for Inexact CSA
5.1.1 General Convex Case
The following preliminary result establishes an important recursion for CSA.
Proposition 5.1**.**
For stepsizes , tolerances , and in Algorithm 1, we have
[TABLE]
for all .
Proof.
For any , using Lemma 2.4, we have
[TABLE]
Observe that if , then and . Moreover, if , then and
[TABLE]
Summing up the inequalities in (9) from to and using the previous two observations, we obtain
[TABLE]
∎
We next present a sufficient condition for the output to be well-defined.
Lemma 5.2**.**
Suppose
[TABLE]
holds. Then , i.e., is well-defined. Furthermore, either (i) or (ii) .
Proof.
Fixing in (8) gives
[TABLE]
If , then since we have
[TABLE]
Suppose that , i.e., . Then,
[TABLE]
which contradicts (11). Thus, condition (i) holds. Alternatively, if , then condition (ii) holds. ∎
We can now prove Lemma 3.1 based on the above result.
Proof of Lemma 3.1.
For , , and chosen as in (4), for we have
[TABLE]
and for we have
[TABLE]
By Lemma 5.2, and is well-defined. ∎
The main convergence properties of Algorithm 1 are established next in Proposition 5.3.
Proposition 5.3**.**
Suppose that and are chosen such that (10) holds, and let be produced by Algorithm 1. Then, for any we have
[TABLE]
and
[TABLE]
Proof.
First, we show that (13) holds. By Lemma 5.2, if , then by the convexity of and the definition of , we have
[TABLE]
If , we have . By fixing in (8), it follows from the definition of and the convexity of that
[TABLE]
Noticing , it follows that
[TABLE]
We then have (13) by the two inequalities (15) and (16). Next we prove (14). For any , we have and so . From the definition of and the convexity of , we then have
[TABLE]
∎
Now we may prove the error bounds for inexact CSA for the generally convex case.
Proof of Theorem 3.2.
The bound on the optimality gap comes from (13). Recall (12), i.e. we have . From , , and , we obtain . It then follows from (13) that
[TABLE]
The bound on constraint violation is by (14). For any , we have and . It then follows from (14) that
[TABLE]
∎
5.1.2 Strongly Convex Case
Now we consider the general error bounds for the strongly convex case (Theorem 3.6). The following lemma will be used in subsequent results, its proof is straightforward and so the details are skipped. We remind the reader that is defined in Section 3.
Lemma 5.4**.**
For all , let and . If sequences and satisfy for all , then for any we have
[TABLE]
We remind the reader that in the next result is originally defined in Section 3.
Proposition 5.5**.**
Suppose Assumption 3.4 holds. Choose stepsizes , tolerances , and . Let be produced according to Algorithm 1, then
[TABLE]
for all .
Proof.
Consider an iteration . If , then by Lemma 2.4 and Assumption 3.4, we have
[TABLE]
Similarly, for , by Lemma 2.4 and Assumption 3.4, we have
[TABLE]
Invoking Lemma 5.4, we then obtain
[TABLE]
Rearranging the terms in the above inequality and recalling the definition of , we arrive at (17). ∎
The following result provides a sufficient condition for to be well-defined.
Lemma 5.6**.**
Suppose Assumption 3.4 and the condition
[TABLE]
hold. Then, , i.e., is well-defined. Furthermore, either (i) or (ii) holds.
Proof.
By fixing in (17), we obtain
[TABLE]
If , noticing , we have
[TABLE]
Suppose that , i.e., . Then, by assumption we have
[TABLE]
which contradicts (19). Thus, condition (i) holds. Alternatively, if then condition (ii) holds. ∎
Based on the above lemma, we may now prove Lemma 3.5.
Proof of Lemma 3.5.
From the selections of , , and in (5), we have for , , , and
[TABLE]
Specifically, , and
[TABLE]
which implies that
[TABLE]
Also, we have
[TABLE]
By Lemma 5.6, we have , i.e., is well-defined. ∎
Before we prove Theorem 3.6, we establish the main convergence properties of Algorithm 1 in the following proposition.
Proposition 5.7**.**
Suppose Assumption 3.4 holds, and suppose that and are chosen such that (18) holds. Let be generated according to Algorithm 1. Then, for any we have
[TABLE]
and
[TABLE]
Proof.
We first show that (21) holds. By Lemma 5.6, we have two cases. If holds, using the convexity of and the definition of , we obtain which implies (21). If , then we have . Take in (17), from Assumptions** A2**, A4, the definition of , and the fact that , we have
[TABLE]
Noticing , it follows that (21) holds.
Next we prove (22). For any , we have by definition. Then, for any we must have . From the definition of , and the convexity of , we obtain
[TABLE]
∎
We now have the machinery in place to prove the error bound for inexact CSA in the strongly convex case.
Proof of Theorem 3.6.
We bound the optimality gap by (21) as follows. Recall (20), we have
[TABLE]
Further, we have . It then follows from (21) that
[TABLE]
Next, we bound the constraint violation by (22). Noticing that is a constant, it immediately follows from (22) that
[TABLE]
∎
5.2 CSA with Fixed Sampling
In this subsection we develop the proofs for our fixed sampling scheme. At each iteration , is fixed, and we face the cut generation Problem (2) which can be written in epigraph form (where the index is omitted):
[TABLE]
We repeat the definition of uniform level-set bound (ULB) from [12] as follows.
Definition 5.8**.**
[12, Definition 3.1] For fixed , the tail probability of the worst-case violation is the function defined by . We call a uniform level-set bound (ULB) of if for all , .
Let be i.i.d. samples generated according to a probability distribution . The sampled problem derived from Problem (23) is
[TABLE]
which is equivalent to .
Let be the unique solution of Problem (24). This optimal solution is a random variable that depends on the samples . As a direct application of Theorem 3.6 in [12], we have the following key result.
Proposition 5.9**.**
Consider the Problems (23) and (24) for fixed with the associated optimal values and , respectively. Given a ULB and , in , for all , we have .
From Proposition 5.9, we see that for fixed the gap between and is effectively quantified by a ULB . To control the behavior of as , we require more structure on the probability distribution on , which is imposed in Assumption 4.1. The next result is based on Assumption 4.1.
Proposition 5.10**.**
[12, Proposition 3.8]** Under Assumption 4.1, the function is a ULB, where is the inverse of .
From Propositions 5.9 and 5.10, we obtain the following bound in probability.
Proposition 5.11**.**
Suppose Assumption 4.1 holds. Given and , for i.i.d. samples from , we have .
Now we can estimate the empirical constraint violation for the fixed sampling scheme.
Proof of Proposition 4.2.
From Proposition 5.11, we have
[TABLE]
Therefore, we have
[TABLE]
∎
Next, we give the proof for Theorem 4.3 (for the generally convex case under the fixed sampling scheme). The proof uses Proposition 4.2 to control the error terms in our general inexact CSA analysis.
Proof of Theorem 4.3.
From Proposition 4.2, we have for all . For with , we have . Moreover, . Thus, from independence of samples, we have
[TABLE]
Subsequently, Theorem 3.2 gives . ∎
The proof of Theorem 4.4 (for the strongly convex case) is as follows.
Proof of Theorem 4.4.
From Proposition 4.2, we have . It follows that
[TABLE]
Therefore, from Theorem 3.6, we arrive at the inequality . ∎
5.3 CSA with Adaptive Sampling
This subsection considers the adaptive sampling scheme. First, we need to prove two prerequisite Lemmas 4.6 and 4.7. Lemma 4.6 establishes an equivalence between the nonlinear finite-dimensional optimization problem and an infinite-dimensional linear optimization problem .
Proof of Lemma 4.6.
The existence of a maximizer can be guaranteed by Assumptions** A3**, A5. On one hand, for any ,
[TABLE]
Since is arbitrary, we have . On the other hand, we can put all mass of on , i.e., the Dirac measure , thus , which implies . ∎
Lemma 4.7 justifies the existence of a solution of the regularized cut generation Problem (6), and provides a closed form expression.
Proof of Lemma 4.7.
By Theorem 15.11 in [1], is compact in the weak-star topology since is compact. Further, the mapping is continuous with respect to the weak-star topology in from Assumption A5, the mapping is lower semi-continuous with respect to the weak-star topology in by invoking Theorem 5.27 in [14], and so is upper semi-continuous in with respect to the weak-star topology. Therefore, the maximizer of is attained in .
Let denote the space of non-negative measures on . We note that the regularized cut generation Problem (6) is a constrained calculus of variations problem:
[TABLE]
By using Euler’s equation in the calculus of variations (see Section 7.5 in [36]), we obtain after simplification,
[TABLE]
where and is the Lagrange multiplier of the constraint . From (25) and the constraint , we obtain the expression
[TABLE]
∎
The following lemma is an intermediate result, where we use the Assumption 4.5 that is full dimensional and convex. It is used in the proof of Proposition 4.8, which paves the way for the cut generation result for the adaptive sampling scheme. Recall that is the gamma function.
Lemma 5.12**.**
Suppose Assumption 4.5 holds. For any and we have
[TABLE]
Proof.
First, we have
[TABLE]
where , and the last inequality follows since \max_{\delta\in\Delta}g(x,\,\delta)-g(x,\delta)=g(x,\delta^{\ast}(x))-g(x,\delta)$$\leq L_{g,\Delta}\left\|\delta^{\ast}(x)-\delta\right\| due to Assumption A5. It is then sufficient to show
[TABLE]
Let . Since is convex by Assumption 4.5, we deduce
[TABLE]
which implies that, for any there exists such that . Then, for any we have
[TABLE]
Therefore,
[TABLE]
where the first inequality is by (28) and the second is by (29), and the equality follows since is the volume of the Euclidean ball with radius in . ∎
Now we are in a position to establish Proposition 4.8.
Proof of Proposition 4.8.
By replacing (26) in , we obtain after simplification,
[TABLE]
Applying (27) to bound the term in the right hand side of (30), we obtain
[TABLE]
Since , we have
[TABLE]
where the first inequality holds since , the second holds because , and the last one follows from . Therefore, we have . Moreover, since solves the regularized cut generation Problem (6), and since the regularization parameter and the Kullback-Liebler divergence are non-negative, we arrive at the conclusion. ∎
The bound for cut generation under the adaptive sampling scheme (Proposition 4.9) is an immediate result from Proposition 4.8.
Proof of Proposition 4.9.
Since are i.i.d. samples from probability density , we have from Proposition 4.8, , for . Therefore, as long as , we have . ∎
We now prove our main result Theorem 4.10 (for the generally convex case) under the adaptive sampling scheme. We need to use Proposition 4.9 to control the error terms.
Proof of Theorem 4.10.
From Proposition 4.9, we have . Furthermore, for , and by independence of samples we have
[TABLE]
Subsequently, Theorem 3.2 gives . ∎
The proof of Theorem 4.11 (for the strongly convex case) under the adaptive sampling scheme is as follows.
Proof of Theorem 4.11.
From Proposition 4.9, we have . It follows that
[TABLE]
Therefore, applying Theorem 3.6 gives . ∎
6 Numerical Experiments
This section applies our methods to a simple test problem adapted from [5] to illustrate the theory developed in this paper. Let for all denote uncertain parameters such that . We want to solve the following optimization problem:
[TABLE]
where
[TABLE]
We compare Algorithm 1 with the fixed constraint sampling and the adaptive constraint sampling schemes, respectively. The parameters in policy (4) are inherently conservative. In this experiment, we adjust the parameters and by multiplying them with scaling parameters and , respectively. These scaling parameters are chosen by doing pilot runs (see [28]). Under fixed constraint sampling, we set to be constant in all iterations, and we consider . Under adaptive constraint sampling, we generate the probability distribution by the Metropolis-Hastings (MH) algorithm (see e.g. [46]), where we run the MH algorithm for 200 iterations and then take one sample to solve the cut generation problem.
Table 1 reports the results. As we can see, even though we only generate one sample in each iteration under the adaptive sampling scheme, the objective value achieved is , which is close to the true optimal value . Figure 1(a) illustrates the convergence of the algorithms and Figure 1(b) shows the constraint violation under different sampling schemes. In particular, we note that under the fixed sampling scheme, the constraint violation decreases as the sample size increases. Note that we scale the parameters and in policy (4) in the experiment, which may result in the failure of Lemma 3.1. We see from Figure 2, with the parameter adjustment, that and is at least linearly increasing in , so that our theoretical analysis is still valid in this case (which depends on this property of ).
We generate the probability distribution by the MH algorithm, and perform sensitivity analysis on the number of iterations of MH. We provide the associated objective values by fixing . We can see from Figure 3 that the adaptive sampling scheme achieves a high-performance solution (with relative gap smaller than ) when the MH algorithm runs for 200 iterations.
From these experiments, we observe the inherent trade-off between the two sampling schemes. Under fixed sampling, although only a fixed sampling distribution is used along all iterations, we need to generate batch samples to achieve good performance. In contrast, under adaptive sampling, we need extra effort to generate samples, but only need one sample at each iteration.
7 Conclusion
In this work, we combine CSA (as originally developed in [29]) with inexact cut generation to solve SIPs. Since the cut generation problem is typically intractable, we emphasize random constraint sampling to approximately solve this problem. In our first approach, we rely on a fixed constraint sampling distribution. Our second approach adaptively updates the constraint sampling distribution, based on the current iterate. The major advantage of adaptive over fixed sampling is that, theoretically, it only requires one sample at each iteration.
As our main contribution, we provide general error bounds (in terms of the error in solving each cut generation problem) for inexact CSA. We show that both our sampling schemes achieve an rate of convergence in expectation, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex. We also improve this rate to in the strongly convex case.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] CD Aliprantis and KC Border. Infinite dimensional analysis. a hitchhiker’s guide. 2006.
- 2[2] Bruno Betrò. An accelerated central cutting plane algorithm for linear semi-infinite programming. Mathematical Programming , 101(3):479–495, 2004.
- 3[3] Nikhil Bhat, Vivek Farias, and Ciamac C Moallemi. Non-parametric approximate dynamic programming via the kernel method. In Advances in Neural Information Processing Systems , pages 386–394, 2012.
- 4[4] J Frédéric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems . Springer Science & Business Media, 2013.
- 5[5] Giuseppe Calafiore and M.C. Campi. Uncertain convex programs: randomized solutions and confidence levels. Mathematical Programming Series A , 102:25–46, 2005.
- 6[6] Marco C Campi and Simone Garatti. The exact feasibility of randomized solutions of uncertain convex programs. SIAM Journal on Optimization , 19(3):1211–1230, 2008.
- 7[7] Darinka Dentcheva and Andrzej Ruszczynski. Optimization with stochastic dominance constraints. SIAM Journal on Optimization , 14(2):548–566, 2003.
- 8[8] Darinka Dentcheva and Andrzej Ruszczyński. Optimality and duality theory for stochastic optimization problems with nonlinear dominance constraints. Mathematical Programming , 99(2):329–350, 2004.
