Corporative Stochastic Approximation with Random Constraint Sampling for   Semi-Infinite Programming

Bo Wei; William B. Haskell; Sixiang Zhao

arXiv:1812.09017·math.OC·December 24, 2018

Corporative Stochastic Approximation with Random Constraint Sampling for Semi-Infinite Programming

Bo Wei, William B. Haskell, Sixiang Zhao

PDF

Open Access

TL;DR

This paper introduces a new stochastic approximation algorithm for semi-infinite programming that handles inexact constraint solving and achieves optimal convergence rates under convexity assumptions.

Contribution

It proposes a novel CSA algorithm with random constraint sampling schemes and provides convergence guarantees for convex and strongly convex cases.

Findings

01

Achieves an $ ext{O}(1/\sqrt{N})$ convergence rate for convex functions.

02

Improves to an $ ext{O}(1/N)$ rate for strongly convex functions.

03

Provides error bounds for inexact CSA in semi-infinite programming.

Abstract

We developed a corporative stochastic approximation (CSA) type algorithm for semi-infinite programming (SIP), where the cut generation problem is solved inexactly. First, we provide general error bounds for inexact CSA. Then, we propose two specific random constraint sampling schemes to approximately solve the cut generation problem. When the objective and constraint functions are generally convex, we show that our randomized CSA algorithms achieve an $O (1/ N)$ rate of convergence in expectation (in terms of optimality gap as well as SIP constraint violation). When the objective and constraint functions are all strongly convex, this rate can be improved to $O (1/ N)$ .

Tables1

Table 1. Table 1: Simulation results with 10 3 superscript 10 3 10^{3} iterations, c g = 0.35 subscript 𝑐 𝑔 0.35 c_{g}=0.35 , and c e = 0.001 subscript 𝑐 𝑒 0.001 c_{e}=0.001 .

	Adaptive sampling	Fixed constraint sampling				Optimal value
	Adaptive sampling	$M_{k} = 10$	$M_{k} = 20$	$M_{k} = 50$	$M_{k} = 100$	Optimal value
Objective values	$- 1.560$	$- 1.621$	$- 1.595$	$- 1.575$	$- 1.566$	$- 1.559$
Relative gaps	$- 0.1 %$	$- 4.0 %$	$- 2.3 %$	$- 1.0 %$	$- 0.5 %$	-

Equations217

D (ϕ, φ) := E_{δ \sim ϕ} [lo g (\frac{ϕ ( δ )}{φ ( δ )})] = \int_{Δ} lo g (\frac{ϕ ( δ )}{φ ( δ )}) ϕ (d δ)

D (ϕ, φ) := E_{δ \sim ϕ} [lo g (\frac{ϕ ( δ )}{φ ( δ )})] = \int_{Δ} lo g (\frac{ϕ ( δ )}{φ ( δ )}) ϕ (d δ)

\displaystyle\min_{x\in\mathcal{X}}\Big{\{}f(x):G(x):=\max_{\delta\in\Delta}g(x,\delta)\leq 0\Big{\}}.

\displaystyle\min_{x\in\mathcal{X}}\Big{\{}f(x):G(x):=\max_{\delta\in\Delta}g(x,\delta)\leq 0\Big{\}}.

f (x) \geq f (z) + ⟨ f^{'} (z), x - z ⟩ + \frac{α}{2} ∥ x - z ∥^{2}, \forall x, z \in X .

f (x) \geq f (z) + ⟨ f^{'} (z), x - z ⟩ + \frac{α}{2} ∥ x - z ∥^{2}, \forall x, z \in X .

V (P_{x, X} (y), u) \leq V (x, u) + ⟨ y, u - x ⟩ + \frac{1}{2} ∥ y ∥^{2} .

V (P_{x, X} (y), u) \leq V (x, u) + ⟨ y, u - x ⟩ + \frac{1}{2} ∥ y ∥^{2} .

δ \in Δ max g (x_{k}, δ)

δ \in Δ max g (x_{k}, δ)

δ_{k} \approx ar g δ \in Δ max g (x_{k}, δ),

δ_{k} \approx ar g δ \in Δ max g (x_{k}, δ),

I := {s, \dots, N}

I := {s, \dots, N}

B := {s \leq k \leq N ∣ g (x_{k}, δ_{k}) \leq η_{k}} \mbox an d N := I \ B .

B := {s \leq k \leq N ∣ g (x_{k}, δ_{k}) \leq η_{k}} \mbox an d N := I \ B .

\overline{x}_{N, s} := \frac{\sum _{k \in B} γ _{k} x _{k}}{\sum _{k \in B} γ _{k}}

\overline{x}_{N, s} := \frac{\sum _{k \in B} γ _{k} x _{k}}{\sum _{k \in B} γ _{k}}

\displaystyle h_{k}=\left\{\begin{array}[]{l}f^{\prime}(x_{k}),\quad\quad\mbox{if $g(x_{k},\delta_{k})\leq\eta_{k}$},\\ g^{\prime}(x_{k},\delta_{k}),\quad\mbox{otherwise}.\end{array}\right.

\displaystyle h_{k}=\left\{\begin{array}[]{l}f^{\prime}(x_{k}),\quad\quad\mbox{if $g(x_{k},\delta_{k})\leq\eta_{k}$},\\ g^{\prime}(x_{k},\delta_{k}),\quad\mbox{otherwise}.\end{array}\right.

x_{k + 1} = P_{x_{k}, X} (γ_{k} h_{k}) .

x_{k + 1} = P_{x_{k}, X} (γ_{k} h_{k}) .

ε_{k} := G (x_{k}) - g (x_{k}, δ_{k}), \forall k \geq 1.

ε_{k} := G (x_{k}) - g (x_{k}, δ_{k}), \forall k \geq 1.

η_{k} = \frac{6 ( L _{f} + L _{g, X} ) D _{X}}{k}, γ_{k} = \frac{D _{X}}{k ( L _{f} + L _{g, X} )}, k = 1, 2, \dots, N, s = ⌈ \frac{N}{2} ⌉,

η_{k} = \frac{6 ( L _{f} + L _{g, X} ) D _{X}}{k}, γ_{k} = \frac{D _{X}}{k ( L _{f} + L _{g, X} )}, k = 1, 2, \dots, N, s = ⌈ \frac{N}{2} ⌉,

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{6 D _{X} ( L _{f} + L _{g, X} )}{N},

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{6 D _{X} ( L _{f} + L _{g, X} )}{N},

G (\overline{x}_{N, s}) \leq \frac{12 D _{X} ( L _{f} + L _{g, X} )}{N} + \frac{\sum _{k \in B} ε _{k} / k}{\sum _{k \in B} 1/ k} .

G (\overline{x}_{N, s}) \leq \frac{12 D _{X} ( L _{f} + L _{g, X} )}{N} + \frac{\sum _{k \in B} ε _{k} / k}{\sum _{k \in B} 1/ k} .

a_{k}=\left\{\begin{array}[]{l}\frac{\mu_{f}\gamma_{k}}{L},\quad\mbox{if $g$($x_{k}$,$\delta_{k}$)$\leq\eta_{k}$},\\ \frac{\mu_{g}\gamma_{k}}{L},\quad\text{otherwise},\end{array}\right.A_{k}=\left\{\begin{array}[]{l}1,\quad\quad\quad\quad\quad\quad k=1,\\ (1-a_{k})A_{k-1},\quad 2\leq k\leq N,\end{array}\right.\mbox{and}\quad\rho_{k}=\frac{\gamma_{k}}{A_{k}}.

a_{k}=\left\{\begin{array}[]{l}\frac{\mu_{f}\gamma_{k}}{L},\quad\mbox{if $g$($x_{k}$,$\delta_{k}$)$\leq\eta_{k}$},\\ \frac{\mu_{g}\gamma_{k}}{L},\quad\text{otherwise},\end{array}\right.A_{k}=\left\{\begin{array}[]{l}1,\quad\quad\quad\quad\quad\quad k=1,\\ (1-a_{k})A_{k-1},\quad 2\leq k\leq N,\end{array}\right.\mbox{and}\quad\rho_{k}=\frac{\gamma_{k}}{A_{k}}.

\overline{x}_{N, s} = \frac{\sum _{k \in B} ρ _{k} x _{k}}{\sum _{k \in B} ρ _{k}} .

\overline{x}_{N, s} = \frac{\sum _{k \in B} ρ _{k} x _{k}}{\sum _{k \in B} ρ _{k}} .

\eta_{k}=\frac{8L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\},\,\gamma_{k}=\left\{\begin{array}[]{l}\frac{2L}{\mu_{f}(k+1)},\quad\mbox{if $g$($x_{k}$,$\delta_{k}$)$\leq\eta_{k}$},\\ \frac{2L}{\mu_{g}(k+1)},\quad\text{otherwise},\end{array}\right.\,\;s=1.

\eta_{k}=\frac{8L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\},\,\gamma_{k}=\left\{\begin{array}[]{l}\frac{2L}{\mu_{f}(k+1)},\quad\mbox{if $g$($x_{k}$,$\delta_{k}$)$\leq\eta_{k}$},\\ \frac{2L}{\mu_{g}(k+1)},\quad\text{otherwise},\end{array}\right.\,\;s=1.

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{8 L}{N + 1} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}},

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{8 L}{N + 1} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}},

G (\overline{x}_{N, s}) \leq \frac{8 L}{N} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}} + \frac{\sum _{k \in B} k ε _{k}}{\sum _{k \in B} k} .

G (\overline{x}_{N, s}) \leq \frac{8 L}{N} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}} + \frac{\sum _{k \in B} k ε _{k}}{\sum _{k \in B} k} .

δ_{k} \in ar g i = 1, \dots, M_{k} max g (x_{k}, δ_{k}^{(i)})

δ_{k} \in ar g i = 1, \dots, M_{k} max g (x_{k}, δ_{k}^{(i)})

M (ε, β) := ⌈ \frac{ln β}{ln ( 1 - ε )} ⌉,

M (ε, β) := ⌈ \frac{ln β}{ln ( 1 - ε )} ⌉,

\underline{M} \leq x \in X, δ \in Δ min g (x, δ) \leq x \in X, δ \in Δ max g (x, δ) \leq \overline{M} .

\underline{M} \leq x \in X, δ \in Δ min g (x, δ) \leq x \in X, δ \in Δ max g (x, δ) \leq \overline{M} .

Q = Q ({M_{k}}_{k \in B}) := \times_{k \in B} Q^{M_{k}}

Q = Q ({M_{k}}_{k \in B}) := \times_{k \in B} Q^{M_{k}}

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{6 D _{X} ( L _{f} + L _{g, X} )}{N},

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{6 D _{X} ( L _{f} + L _{g, X} )}{N},

E_{Q} [G (\overline{x}_{N, s})] \leq \frac{14 D _{X} ( L _{f} + L _{g, X} )}{N} .

E_{Q} [G (\overline{x}_{N, s})] \leq \frac{14 D _{X} ( L _{f} + L _{g, X} )}{N} .

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{8 L}{N + 1} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}},

f (\overline{x}_{N, s}) - f (x^{*}) \leq \frac{8 L}{N + 1} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}},

E_{Q} [G (\overline{x}_{N, s})] \leq \frac{9 L}{N} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}} .

E_{Q} [G (\overline{x}_{N, s})] \leq \frac{9 L}{N} max {μ_{f}, μ_{g}} max {\frac{L _{f}^{2}}{μ _{f}^{2}}, \frac{L _{g, X}^{2}}{μ _{g}^{2}}} .

E_{δ \sim ϕ} [g (x, δ)] \geq G (x) - ϵ,

E_{δ \sim ϕ} [g (x, δ)] \geq G (x) - ϵ,

ϕ \in P (Δ) max E_{δ \sim ϕ} [g (x, δ)]

ϕ \in P (Δ) max E_{δ \sim ϕ} [g (x, δ)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Complexity and Algorithms in Graphs · Risk and Portfolio Optimization

Full text

Corporative Stochastic Approximation with Random Constraint Sampling

for Semi-Infinite Programming

Bo Wei, William B. Haskell, and Sixiang Zhao

Abstract

We developed a corporative stochastic approximation (CSA) type algorithm for semi-infinite programming (SIP), where the cut generation problem is solved inexactly. First, we provide general error bounds for inexact CSA. Then, we propose two specific random constraint sampling schemes to approximately solve the cut generation problem. When the objective and constraint functions are generally convex, we show that our randomized CSA algorithms achieve an $\mathcal{O}(1/\sqrt{N})$ rate of convergence in expectation (in terms of optimality gap as well as SIP constraint violation). When the objective and constraint functions are all strongly convex, this rate can be improved to $\mathcal{O}(1/N)$ .

1 Introduction

In this paper, we combine the corporative stochastic approximation (CSA) method developed in [29] with inexact cut generation for semi-infinite programming (SIP). In particular, we focus on random sampling methods to approximately solve the SIP cut generation problem. The SIP cut generation problem is usually non-linear and non-convex, so it is difficult to solve it to global optimality deterministically. Two specific random constraint sampling schemes are proposed to overcome this difficulty, and the randomized CSA algorithms demonstrate good performance to solve SIP with theoretically guaranteed convergence rates.

1.1 Previous work

We refer the reader to [4, 17, 22, 35, 47] for recent detailed overviews of SIP. The main computational difficulty in SIP comes from the infinitely many constraints, and several practical schemes have been proposed to remedy this difficulty [15, 16, 35, 45]. We offer the following very rough classification of SIP methods based on [22, 35, 45].

Exchange methods: In exchange methods, in each iteration a set of new constraints is exchanged for the previous set (there are many ways to do this). Cutting plane methods are a special case where constraints are never dropped. The algorithm in [19] is the prototype for several SIP cutting plane schemes, and it has been improved in various ways [2, 27, 37]. In particular, a new exchange method is proposed in [49] that only keeps those active constraints with positive Lagrange multipliers. New constraints are selected using a certain computationally-cheap criterion. In [37], the earlier central cutting plane algorithm from [27] is extended to allow for nonlinear convex cuts.

Randomized cutting plane algorithms have recently been developed for SIP in [5, 6, 12]. The idea is to input a probability distribution over the constraints, randomly sample a modest number of constraints, and then solve the resulting relaxed problem. Intuitively, as long as a sufficient number of samples of the constraints is drawn, the resulting randomized solution should violate only a small portion of the constraints and achieve near optimality.

Discretization methods: In the discretization approach, a sequence of relaxed problems with a finite number of constraints is solved according to a predefined or adaptively controlled grid generation scheme [44, 48]. Discretization methods are generally computationally expensive. The convergence rate of the error between the solution of the SIP problem and the solution of the discretized program is investigated in [48].

Local reduction methods: In the local reduction approach, an SIP problem is reduced to a problem with a finite number of constraints [18]. The reduced problem involves constraints which are defined only implicitly, and the resulting problem is solved via the Newton method which has good local convergence properties. However, local reduction methods require strong assumptions and are often conceptual.

Dual methods: A wide class of SIP algorithms is based on directly solving the KKT conditions. In [25, 33, 34], the authors derive Wolfe’s dual for an SIP and discuss numerical schemes for this problem. The KKT conditions often have some degree of smoothness, and so various Newton-type methods can be applied [30, 39, 42, 43]. However, feasibility is not guaranteed under the all Newton-type methods. A new smoothing Newton-type method is proposed to overcome this drawback in [32].

Applications: SIP is the basis of the approximate linear programming (ALP) approach for dynamic programming. Randomly sampling state-action pairs is shown to give a tractable relaxed linear programming problem, as explored in [3, 11, 13]. In [3, 13], the sampling distribution is assumed to be the occupation measure corresponding to the optimal policy. In [31], an adaptive constraint sampling approach called ’ALP-Secant’ is developed which is based on solving a sequence of saddle-point problems. It is shown that ALP-Secant returns a near optimal ALP solution and a lower bound on the optimal cost with high probability in a finite number of iterations.

Many risk-aware optimization models also depend on SIP (e.g. [40, 41]), in particular, risk-constrained optimization (e.g. [7, 8, 9, 10, 21, 23, 24]). In [7, 8, 9, 20], a duality theory for stochastic dominance constrained optimization is developed which shows the special role of utility functions as Lagrange multipliers. Relaxations of multivariate stochastic dominance have been proposed based on various parametrized families of utility functions, see [9, 20, 23, 24]. Computational aspects of the increasing concave stochastic dominance constrained optimization are discussed in [21, 23, 24].

1.2 Contributions

We summarize our main contributions in this work as follows:

We give error bounds for inexact CSA (where the cut generation problem is solved inexactly). These error bounds are general, and may form the basis for the convergence analysis of many CSA-type algorithms. 2. 2.

We develop two specialized CSA algorithms where random sampling is used to approximately solve the cut generation problem. The first algorithm is based on using a fixed sampling distribution, in line with [5, 6, 12]. Intuitively, as long as a sufficiently large number of samples is drawn, the resulting randomized solution should violate only a "small portion" of the constraints. The second algorithm is based on adaptively sampling the constraints based on information from the current iterate. In particular, we compute the analytical solution of a regularized cut generation problem for the current iterate, and then use this distribution to do adaptive sampling. 3. 3.

We provide a stochastic convergence analysis for both our specialized CSA algorithms based on our general error bounds. We show that as the errors in cut generation decrease at appropriate rates, our specialized CSA algorithms achieve the same convergence rate as in the error-free case. When the objective and constraint functions are convex, both algorithms achieve an $\mathcal{O}(1/\sqrt{N})$ rate of convergence in expectation, in terms of optimality gap and constraint violation. If the objective and constraint functions are strongly convex, this rate can be improved to $\mathcal{O}(1/N)$ .

This paper is organized as follows. We first provide preliminary material in Section 2. The following Section 3 describes a general inexact CSA algorithm, and then provides error bounds (in terms of the error in solving each cut generation problem). Next, in Section 4, we give the formal details for our two specialized CSA algorithms and report their convergence rates. For clearer organization, the detailed proofs of all our results are gathered together in Section 5. We then present some numerical experiments for CSA with random sampling in Section 6. Finally, we conclude the paper in Section 7 with a discussion of further issues and future research.

Notation

We make use of the following basic notation throughout the paper. For $x\in\mathbb{R}$ , the ceiling function $\lceil x\rceil$ returns the smallest integer greater than or equal to $x\in\mathbb{R}$ . The Euclidean norm and inner product on $\mathbb{R}^{n}$ are $\|x\|:=(\sum_{i=1}^{n}x_{i}^{2})^{\frac{1}{2}}$ and $\langle x,y\rangle=\sum_{i=1}^{n}x_{i}y_{i}$ , respectively. The Euclidean ball with radius $r$ centered at $x_{c}$ is $B_{r}(x_{c}):=\left\{x:\,\left\|x-x_{c}\right\|\leq r\right\}$ . For a function $f\text{ : }\mathbb{R}^{n}\rightarrow\mathbb{R}$ , we denote its subdifferential by $\partial f(x)$ and a subgradient of $f$ at $x$ by $f^{\prime}(x)\in\partial f(x)$ , respectively.

We also make use of the following further notation. For any set $\Delta\subset\mathbb{R}^{d}$ , $\mathcal{P}(\Delta)$ is the space of probability distributions on $\Delta$ . The Kullback-Liebler divergence is

[TABLE]

for probability densities $\phi,\varphi\in\mathcal{P}(\Delta)$ . For any integer $M\geq 1$ , we denote the $M-$ Cartesian product of $\Delta$ by $\Delta^{M}:=\times_{i=1}^{M}\Delta$ . Finally, for any probability distribution $Q$ over set $\Delta$ , the product measure and the associated expectation on $\Delta^{M}$ are denoted by $Q^{M}$ and $\mathbb{E}_{Q^{M}}$ , respectively.

2 Preliminaries

We begin our discussion of SIP with the following problem ingredients:

A1

Convex, compact decision set $\mathcal{X}\subset\mathbb{R}^{n}$ ;

A2

Convex objective function $f\text{ : }\mathcal{X}\rightarrow\mathbb{R}$ , which is Lipschitz continuous with constant $L_{f}$ ;

A3

Compact constraint index set $\Delta\subset\mathbb{R}^{d}$ ;

A4

Constraint function $g\text{ : }\mathcal{X}\times\Delta\rightarrow\mathbb{R}$ , such that for each $\delta\in\Delta$ , $x\rightarrow g(x,\delta)$ is convex and Lipschitz continuous with constant $L_{g,\mathcal{X}}$ ;

A5

For all $x\in\mathcal{X}$ , $\delta\rightarrow g(x,\delta)$ is Lipschitz continuous with constant $L_{g,\Delta}$ .

We write the constraints as a single function $G(x):=\max_{\delta\in\Delta}g(x,\delta)$ . The resulting semi-infinite programming problem is:

[TABLE]

Problem (1) is a convex optimization problem under Assumptions A1, A2, and A4. Formally, we also assume that Problem (1) is solvable.

Assumption 2.1.

An optimal solution $x^{*}$ of Problem (1) exists.

To continue, we recall some fundamental concepts of convex analysis.

Definition 2.2.

A function $f:\,\mathcal{X}\rightarrow\mathbb{R}$ is strongly convex with parameter $\alpha>0$ , if for any $f^{\prime}(x)\in\partial f(x)$ we have

[TABLE]

The distance generating function and its associated prox-function are defined as follows.

Definition 2.3.

(i) A function $\omega_{X}:\mathcal{X}\rightarrow\mathbb{R}$ is a distance generating function with parameter $\alpha>0$ , if $\omega_{\mathcal{X}}$ is continuously differentiable and strongly convex with parameter $\alpha$ .

(ii) (Bregman’s distance) The prox-function associated with $\omega_{\mathcal{X}}$ is $V(x,z):=\omega_{\mathcal{X}}(z)-\omega_{\mathcal{X}}(x)-\langle\nabla\omega_{\mathcal{X}}(x),z-x\rangle$ .

(iii) The prox-mapping is $P_{x,\mathcal{X}}(y):=\arg\min_{z\in\mathcal{X}}\{\langle y,z\rangle+V(x,z)\}$ .

Without loss of generality, we may assume that $\alpha=1$ in part (i) of the preceding definition since we can always re-scale $\omega_{\mathcal{X}}(x)$ to become $\overline{\omega}_{\mathcal{X}}(x)=\omega_{\mathcal{X}}(x)/\alpha$ . The distance generating function $\omega_{\mathcal{X}}$ gives a measure of the diameter of $\mathcal{X}$ , i.e. $D_{\mathcal{X}}:=\sqrt{\max_{x,z\in\mathcal{X}}V(x,z)}$ . Clearly, the diameter satisfies $D_{\mathcal{X}}<\infty$ as long as $\mathcal{X}$ is bounded.

We assume that the prox-function $V(x,z)$ is chosen such that the prox-mapping $P_{x,\mathcal{X}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ can be easily computed. The next result follows from the definition of the prox-function.

Lemma 2.4.

[38, Lemma 2.1]** For every $u,x\in\mathcal{X}$ and $y\in\mathbb{R}^{n}$ , we have

[TABLE]

3 General Error Bounds for Inexact CSA

In this section, we derive general error bounds for inexact CSA applied to Problem (1). These error bounds form the basis of our convergence analysis for the two specialized CSA algorithms that we consider in the next section.

The (general) CSA algorithm works as follows. We let $\left\{x_{k}\right\}_{k\geq 1}$ denote the sequence of iterates of the algorithm, $\left\{\gamma_{k}\right\}_{k\geq 1}$ a sequence of step-sizes with all $\gamma_{k}>0$ , and $\left\{\eta_{k}\right\}_{k\geq 1}$ a sequence of error tolerances for constraint violation with all $\eta_{k}>0$ . At each iteration $k\geq 1$ , we need to solve the cut generation problem

[TABLE]

to determine if $x_{k}$ is feasible or to identify any violated constraints. After we obtain

[TABLE]

CSA performs a projected subgradient step with step-size $\gamma_{k}$ along either $f^{\prime}(x_{k})$ or $g^{\prime}(x_{k},\delta_{k})$ , depending on whether the condition $g(x_{k},\delta_{k})\leq\eta_{k}$ is satisfied (i.e. depending on whether the constraint violation is below our error tolerance or not).

Let $N$ denote the total number of iterations of the algorithm. For some $1\leq s\leq N$ , we may partition the indices

[TABLE]

into two subsets:

[TABLE]

The set $\mathcal{B}$ counts those iterations within $I$ for which the constraint violation of $x_{k}$ corresponding to $\delta_{k}\approx\arg\max_{\delta\in\Delta}g(x_{k},\delta)$ is less than our tolerance $\eta_{k}$ . When the algorithm terminates, it returns the weighted average

[TABLE]

of iterates over $\mathcal{B}$ (which only indexes those iterates where we believe the constraint violation is small). The general inexact CSA algorithm is summarized in Algorithm 1.

The cut generation problem $\max_{\delta\in\Delta}g(x,\delta)$ is typically a non-convex optimization problem. Generally speaking, there is no fast algorithm that can solve this problem deterministically. In our case, the error in each iteration comes from inexact solution of $\max_{\delta\in\Delta}g(x_{k},\delta)$ . We denote the error in cut generation as

[TABLE]

Note that the errors $\left\{\varepsilon_{k}\right\}_{k\geq 1}$ are always nonnegative since $G\left(x\right)\geq g\left(x,\,\delta\right)$ for all $\delta\in\Delta$ by definition.

Below we give a specific selection of the parameters $\{\eta_{k}\}_{k\geq 1}$ , $\{\gamma_{k}\}_{k\geq 1}$ , and $s$ to be used in Algorithm 1:

[TABLE]

for all $N\geq 1$ . The following result shows that $\overline{x}_{N,s}$ is well-defined under this policy.

Lemma 3.1.

Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated by Algorithm 1 with policy (4), then the set $\mathcal{B}\neq\emptyset$ , i.e., $\overline{x}_{N,s}$ is well-defined.

Now we will bound the optimality gap and constraint violation of $\bar{x}_{N,\,s}$ in terms of the errors $\left\{\varepsilon_{k}\right\}_{k\geq 1}$ from inexact cut generation. The result of Theorem 3.2 is online since policy (4) does not depend on knowing $N$ in advance, and thus we may stop or continue the algorithm anytime. In particular, the weighted average $\overline{x}_{N,\,s}$ from Theorem 3.2 gives decreasing weight to older iterates $\{x_{k}\}_{k\geq 1}$ .

Theorem 3.2.

Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated by Algorithm 1 with policy (4), then for any $N\geq 1$ we have

[TABLE]

and

[TABLE]

*Remark 3.3**.*

The bound on the optimality gap does not depend on the errors $\{\varepsilon_{k}\}_{k\geq 1}$ in cut generation, since objective function evaluations are error free (in contrast to inexact evaluation of the constraint function $G\left(x\right)$ ).

We can improve the $O\left(1/\sqrt{N}\right)$ convergence rate when the objective function $f(\cdot)$ and the constraint functions $\left\{g(\cdot,\delta)\right\}_{\delta\in\Delta}$ are all strongly convex. To proceed, we introduce a new assumption on the quadratic growth of the prox-function $V(\cdot,\cdot)$ .

Assumption 3.4.

(i) The objective function $f$ is strongly convex with parameter $\mu_{f}>0$ , and the constraint functions $g(\cdot,\delta)$ are all strongly convex with parameter $\mu_{g}>0$ (uniformly in all $\delta\in\Delta$ ).

(ii) There exists $L>0$ , such that $V(x,z)\leq\frac{L}{2}\|x-z\|^{2},\forall x,z\in\mathcal{X}$ .

The constants in Assumption 3.4 appear in our parameter selection policy for the strongly convex case. For all $k=1,2,\ldots,N$ , let $\gamma_{k}$ be the step-sizes used in our algorithms, and denote

[TABLE]

For the strongly convex case, the output of Algorithm 1 is modified to

[TABLE]

Our new policy is given as follows: for $k=1,2,\ldots,N,$

[TABLE]

The following result shows that $\overline{x}_{N,s}$ is well-defined for this policy as well.

Lemma 3.5.

Suppose Assumption 3.4 holds. Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated by Algorithm 1 with policy (5), then the set $\mathcal{B}\neq\emptyset$ , i.e., $\overline{x}_{N,s}$ is well-defined.

Now we give an improved error bound for inexact CSA under policy (5) for the strongly convex case.

Theorem 3.6.

Suppose Assumption 3.4 holds. Let $\left\{x_{k}\right\}_{k\geq 1}$ be generated by Algorithm 1 with policy (5), then for any $N\geq 1$ we have

[TABLE]

and

[TABLE]

*Remark 3.7**.*

In the strongly convex case, the convergence rate may be improved to $O\left(1/N\right)$ if the errors in cut generation decrease at appropriate rate.

4 Random Constraint Sampling

As we have already pointed out, the cut generation Problem (2) is a general nonlinear non-convex optimization problem, and there is no fast algorithm that can solve such a problem deterministically. In this section, we describe two random constraint sampling schemes that can approximately solve the cut generation problem. The first scheme is based on sampling from a fixed probability distribution (Subsection 4.1), while the second scheme is based on sampling adaptively from a probability distribution that is updated in each iteration based on the current iterate (Subsection 4.2).

4.1 Fixed Constraint Sampling

In this subsection, we approximately solve the cut generation Problem (2) by sampling from a fixed distribution on $\Delta$ . To begin, we take a probability distribution $Q$ on $\Delta$ as user input. To solve Problem (2) at iteration $k\geq 1$ , we let $\delta_{k}^{(1)},\delta_{k}^{(2)},\ldots,\delta_{k}^{(M_{k})}$ (where $M_{k}\geq 1$ is the sample size for all $k\geq 1$ ) be independent identically distributed (i.i.d.) samples from $\Delta$ generated according to $Q$ . Then, we define

[TABLE]

to be the element among $\left\{\delta_{k}^{(i)}\right\}_{i=1}^{M_{k}}$ which maximizes $\left\{g\left(x_{k},\,\delta_{k}^{\left(i\right)}\right)\right\}_{i=1}^{M_{k}}$ .

We need the following assumption on the sampling distribution $Q$ .

Assumption 4.1.

There exists a strictly increasing function $\varphi:\mathbb{R}_{+}\rightarrow[0,1]$ such that $Q\{B_{r}(\delta)\}\geq\varphi(r)$ , for all $\delta\in\Delta$ and all open balls $B_{r}(\delta)\subset\Delta$ .

The above assumption means that $Q$ has support on all of $\Delta$ , it also appears in Proposition 3.8 of [12]. For more discussion, the reader is referred to Assumption 3.1 of [26].

Intuitively, as long as the number of samples $M$ is large enough, we expect $\max_{1\leq i\leq M}g(x,\delta^{(i)})$ will be close to $G(x)$ with high probability with respect to $Q^{M}$ . We have a result in expectation for the approximation quality. For $\varepsilon$ , $\beta$ in $(0,1)$ , we define

[TABLE]

which will appear in the next result to denote the threshold of sample size. Denote the lower bound and upper bound of $g(x,\delta)$ over $x\in\mathcal{X},\delta\in\Delta$ as $\underline{M}$ and $\overline{M}$ (due to the continuity of $(x,\delta)\rightarrow g(x,\delta)$ , and the compactness of $\mathcal{X}$ and $\Delta$ ), respectively, i.e.,

[TABLE]

Proposition 4.2.

Suppose Assumption 4.1 holds. Given $\epsilon>0$ , for $M\geq M(\varphi(\frac{\epsilon}{2L_{g,\Delta}}),\frac{\epsilon}{2(\overline{M}-\underline{M})})$ i.i.d. samples generated from $Q$ , we have $\mathbb{E}_{Q^{M}}\left[\max_{1\leq i\leq M}g(x,\delta^{(i)})\right]\geq G(x)-\epsilon$ .

We now investigate the convergence of inexact CSA based on this fixed sampling scheme. We define

[TABLE]

to be the probability distribution of the samples $\Big{\{}\{\delta_{k}^{(i)}\}_{i=1}^{M_{k}}\Big{\}}_{k\in\mathcal{B}}$ on the space $\times_{k\in\mathcal{B}}\Delta^{M_{k}}$ .

Theorem 4.3.

Suppose Assumption 4.1 holds. Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated by Algorithm 1 under policy (4). Take $\epsilon_{k}=(L_{f}+L_{g,\mathcal{X}})D_{\mathcal{X}}/\sqrt{k}$ , and $M_{k}\geq M(\varphi(\frac{\epsilon_{k}}{2L_{g,\Delta}}),\frac{\epsilon_{k}}{2(\overline{M}-\underline{M})})$ for all $k\geq 1$ . Then, for any $N\geq 1$ , we have

[TABLE]

and

[TABLE]

In view of Theorem 4.3 we see that inexact CSA with fixed random constraint sampling achieves an $\mathcal{O}(1/\sqrt{N})$ rate of convergence in expectation (with respect to $\mathcal{Q}$ ) for solving Problem (1) in the general convex case. Next we consider an improved convergence rate for the strongly convex case.

Theorem 4.4.

Suppose Assumptions 3.4 and 4.1 hold. Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated by Algorithm 1 under policy (5). Take $\epsilon_{k}=\frac{L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}$ , and $M_{k}\geq M(\varphi(\frac{\epsilon_{k}}{2L_{g,\Delta}}),\frac{\epsilon_{k}}{2(\overline{M}-\underline{M})})$ for all $k\geq 1$ . Then, for any $N\geq 1$ , we have

[TABLE]

and

[TABLE]

4.2 Adaptive Constraint Sampling

In this subsection, we consider an alternative adaptive constraint sampling scheme for the cut generation Problem (2). In particular, in iteration $k$ we will construct a constraint sampling distribution that is tailored to the current iterate $x_{k}$ . More specifically, for any $\epsilon>0$ and $x\in\mathcal{X}$ , we want to find a probability distribution $\phi=\phi\left(x,\,\epsilon\right)\in\mathcal{P}(\Delta)$ (which depends on $x$ and $\epsilon$ ) on $\Delta$ , such that

[TABLE]

which guarantees that the samples generated from this distribution are very likely to solve our cut generation Problem (2). Then, in each iteration we will construct such a distribution from $x_{k}$ , and use it to guide our next round of random constraint sampling.

To continue, we introduce a new assumption on the set $\Delta$ .

Assumption 4.5.

The set $\Delta\subset\mathbb{R}^{d}$ is full dimensional and convex.

The following preliminary lemma is key for our adaptive sampling scheme. It establishes an equivalence between the general nonlinear finite-dimensional optimization problem $\max_{\delta\in\Delta}g(x,\delta)$ and the infinite-dimensional linear optimization problem

[TABLE]

in probability distributions.

Lemma 4.6.

For all $x\in\mathcal{X}$ , $G(x)=\max_{\delta\in\Delta}g(x,\delta)=\max_{\phi\in\mathcal{P}(\Delta)}\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\widetilde{\delta}\right)\right]$ .

Let $\phi_{u}$ denote the uniform probability distribution on $\Delta$ , that is,

[TABLE]

We define a regularized cut generation problem as follows,

[TABLE]

where $\kappa\in(0,1]$ is the regularization parameter. The mapping $\phi\rightarrow D\left(\phi,\phi_{u}\right)$ is convex, thus the regularized cut generation Problem (6) is an infinite-dimensional convex optimization problem. We can expect that if the regularization parameter $\kappa$ is small enough, the solution of the regularized Problem (6) provides useful information to solve our cut generation Problem (2).

We will show that the regularized cut generation Problem (6) is well defined. In particular, we show that the maximizer (which depends on $x$ and $\kappa$ )

[TABLE]

is attained and is given in closed form. The next lemma is based on calculus of variations.

Lemma 4.7.

For any $\kappa\in(0,1]$ and $x\in\mathcal{X}$ , the maximizer of Problem (6) is attained, and it is

[TABLE]

Since $\Delta\subset\mathbb{R}^{d}$ is full dimensional, we may let $R_{\Delta}$ be the radius of the largest ball which can be included in $\Delta$ . Specifically, there exists $\delta_{0}\in\Delta$ such that the Euclidean ball $B_{R_{\Delta}}(\delta_{0})\subseteq\Delta$ . Define

[TABLE]

to be the ratio between the volume of the largest such ball $B_{R_{\Delta}}(\delta_{0})$ and the volume of $\Delta$ (necessarily $r\leq 1$ ). The following result demonstrates that the gap between the cut generation Problem (2) and its regularization (6) can be made arbitrarily small through our control of $\epsilon$ . Let $D_{\Delta}:=\max_{\delta,\delta^{\prime}\in\Delta}\left\|\delta-\delta^{\prime}\right\|$ denote the Euclidean diameter of $\Delta$ and define $C:=L_{g,\Delta}(R_{\Delta}+D_{\Delta})-\log(r)$ . We also define

[TABLE]

Proposition 4.8.

Suppose Assumption 4.5 holds and choose $\epsilon>0$ . For any $x\in\mathcal{X}$ ,

[TABLE]

and

[TABLE]

From (7), we see that the solution of the regularized cut generation Problem (6) provides a solution of the inequality $\mathbb{E}_{\widetilde{\delta}\sim\phi_{\mathfrak{\kappa(\epsilon)},\,x}}\left[g\left(x,\widetilde{\delta}\right)\right]\geq G(x)-\epsilon$ .

The adaptive constraint sampling scheme works as follows. Suppose we are given tolerances $\left\{\epsilon_{k}\right\}_{k\geq 1}$ with $\epsilon_{k}>0$ for all $k\geq 1$ . At iteration $k\geq 1$ , we sample from the probability density $\phi_{\mathfrak{\kappa}(\epsilon_{k}),\,x_{k}}$ . Let $\delta_{k}^{(1)},\delta_{k}^{(2)},\ldots,\delta_{k}^{(M_{k})}$ (with $M_{k}\geq 1$ ) be i.i.d. samples from $\Delta$ generated according to $\phi_{\mathfrak{\kappa}(\epsilon_{k}),\,x_{k}}$ , and again define $\delta_{k}\in\arg\max_{i=1,\ldots,\,M_{k}}g\left(x_{k},\,\delta_{k}^{\left(i\right)}\right)$ to be a maximizer of $\{g(x_{k},\delta_{k}^{(i)})\}_{i=1}^{M_{k}}$ .

Proposition 4.9.

Suppose Assumption 4.5 holds. Given $\epsilon>0$ and $x\in\mathcal{X}$ , for any $M\geq 1$ , let $\delta^{\left(1\right)},\ldots,\,\delta^{\left(M\right)}$ be i.i.d. samples from $\phi_{\mathfrak{\kappa}(\epsilon),\,x}$ , then $\mathbb{E}_{\phi_{\kappa(\epsilon),\,x}^{M}}\left[\max_{1\leq i\leq M}g(x,\delta^{(i)})\right]\geq G(x)-\epsilon$ .

Now we consider the convergence rate of the adaptive sampling scheme. We define the distribution

[TABLE]

of the samples $\Big{\{}\{\delta_{k}^{(i)}\}_{i=1}^{M_{k}}\Big{\}}_{k\in\mathcal{B}}$ on the space $\times_{k\in\mathcal{B}}\Delta^{M_{k}}$ .

Theorem 4.10.

Suppose Assumption 4.5 holds. Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated according to Algorithm 1 with policy (4). For each $k\geq 1$ and $\epsilon_{k}=(L_{f}+L_{g,\mathcal{X}})D_{\mathcal{X}}/\sqrt{k}$ , we generate $M_{k}\geq 1$ i.i.d. samples according to $\phi_{\mathfrak{\kappa}(\epsilon_{k}),\,x_{k}}$ . Then, for any $N\geq 1$ ,

[TABLE]

and

[TABLE]

As for the fixed sampling scheme, we find an improved convergence rate for the strongly convex case under the adaptive sampling scheme as well.

Theorem 4.11.

Suppose Assumptions 3.4 and 4.5 hold. Suppose $\left\{x_{k}\right\}_{k\geq 1}$ is generated according to Algorithm 1 with policy (5). For each $k\geq 1$ and $\epsilon_{k}=\frac{L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}$ , we generate $M_{k}\geq 1$ i.i.d. samples according to $\phi_{\mathfrak{\kappa}(\epsilon_{k}),\,x_{k}}$ . Then, for any $N\geq 1$ ,

[TABLE]

and

[TABLE]

In view of Theorem 4.11, inexact CSA with adaptive sampling achieves an $\mathcal{O}(1/N)$ rate of convergence in expectation, in terms of the optimality gap and constraint violation, in the strongly convex case.

*Remark 4.12**.*

Through Proposition 4.2 and Proposition 4.9, we see two major differences between the fixed sampling and adaptive sampling schemes. First, the fixed sampling scheme requires batch samples, while only one sample per iteration is needed to make the adaptive sampling scheme work due to the inequality $\mathbb{E}_{\widetilde{\delta}\sim\phi_{\mathfrak{\kappa(\epsilon)},\,x}}\left[g\left(x,\widetilde{\delta}\right)\right]\geq G(x)-\epsilon$ . Of course, we get better performance if we use batch sampling under the adaptive sampling since we always have $\max_{1\leq i\leq M}g(x,\delta^{(i)})\geq g(x,\delta^{(i)})$ for each $1\leq i\leq M$ . Second, $\mathcal{P}=\times_{k\in\mathcal{B}}\phi_{\mathfrak{\kappa}(\epsilon_{k}),\,x_{k}}^{M_{k}}$ depends on the error tolerances and the current iterates under the adaptive sampling, while $\mathcal{Q}=\times_{k\in\mathcal{B}}Q^{M_{k}}$ does not under the fixed sampling scheme. There is a trade-off between the two sampling schemes. Under the fixed sampling scheme, we do not need to change the sampling distribution iteration by iteration, but it requires batch samples to achieve a desired cut generation tolerance. Under the adaptive sampling scheme, we need to generate different sampling distributions at different iterations, but the required number of samples is much smaller.

5 Proofs of Main Results

In this section, we provide the proofs for our main results. In Subsection 5.1, we establish the general error bounds for inexact CSA (Theorem 3.2 for the generally convex case and Theorem 3.6 for the strongly convex case). The details of the fixed sampling cut generation scheme (Proposition 4.2) and the corresponding CSA convergence results (Theorems 4.3 and 4.4) are in Subsection 5.2. All material for the adaptive sampling cut generation scheme (Proposition 4.9) and the corresponding CSA convergence results (Theorems 4.10 and 4.11) are in Subsection 5.3.

5.1 General Error Bounds Analysis

for Inexact CSA

5.1.1 General Convex Case

The following preliminary result establishes an important recursion for CSA.

Proposition 5.1.

For stepsizes $\{\gamma_{k}\}_{k\geq 1}$ , tolerances $\{\eta_{k}\}_{k\geq 1}$ , and $1\leq s\leq N$ in Algorithm 1, we have

[TABLE]

for all $x\in\mathcal{X}$ .

Proof.

For any $s\leq k\leq N$ , using Lemma 2.4, we have

[TABLE]

Observe that if $k\in\mathcal{B}$ , then $h_{k}=f^{\prime}(x_{k})$ and $\langle h_{k},x_{k}-x\rangle=\langle f^{\prime}(x_{k}),x_{k}-x\rangle$ . Moreover, if $k\in\mathcal{N}$ , then $h_{k}=g^{\prime}(x_{k},\delta_{k})$ and

[TABLE]

Summing up the inequalities in (9) from $k=s$ to $N$ and using the previous two observations, we obtain

[TABLE]

∎

We next present a sufficient condition for the output $\overline{x}_{N,s}$ to be well-defined.

Lemma 5.2.

Suppose

[TABLE]

holds. Then $\mathcal{B}\neq\emptyset$ , i.e., $\overline{x}_{N,s}$ is well-defined. Furthermore, either (i) $|\mathcal{B}|\geq(N-s+1)/2$ or (ii) $\sum_{k\in\mathcal{B}}\gamma_{k}\langle f^{\prime}(x_{k}),x_{k}-x^{*}\rangle<0$ .

Proof.

Fixing $x=x^{*}$ in (8) gives

[TABLE]

If $\sum_{k\in\mathcal{B}}\gamma_{k}\langle f^{\prime}(x_{k}),x_{k}-x^{*}\rangle\geq 0$ , then since $g(x^{*},\delta_{k})\leq G(x^{*})\leq 0$ we have

[TABLE]

Suppose that $|\mathcal{B}|<(N-s+1)/2$ , i.e., $|\mathcal{N}|\geq(N-s+1)/2$ . Then,

[TABLE]

which contradicts (11). Thus, condition (i) holds. Alternatively, if $\sum_{k\in\mathcal{B}}\gamma_{k}\langle f^{\prime}(x_{k}),x_{k}-x^{*}\rangle<0$ , then condition (ii) holds. ∎

We can now prove Lemma 3.1 based on the above result.

Proof of Lemma 3.1.

For $\{\gamma_{k}\}_{k\geq 1}$ , $\{\eta_{k}\}_{k\geq 1}$ , and $s$ chosen as in (4), for $\mathcal{N}\subset\{\lceil N/2\rceil,\lceil N/2\rceil+1,\ldots,N\}$ we have

[TABLE]

and for $\mathcal{B}\cup\mathcal{N}=\{\lceil N/2\rceil,\lceil N/2\rceil+1,\ldots,N\}$ we have

[TABLE]

By Lemma 5.2, $\mathcal{B}\neq\emptyset$ and $\overline{x}_{N,s}$ is well-defined. ∎

The main convergence properties of Algorithm 1 are established next in Proposition 5.3.

Proposition 5.3.

Suppose that $\{\gamma_{k}\}_{k\geq 1}$ and $\{\eta_{k}\}_{k\geq 1}$ are chosen such that (10) holds, and let $\left\{x_{k}\right\}_{k\geq 1}$ be produced by Algorithm 1. Then, for any $1\leq s\leq N$ we have

[TABLE]

and

[TABLE]

Proof.

First, we show that (13) holds. By Lemma 5.2, if $\sum_{k\in\mathcal{B}}\gamma_{k}\langle f^{\prime}(x_{k}),x_{k}-x^{*}\rangle<0$ , then by the convexity of $f$ and the definition of $\overline{x}_{N,s}$ , we have

[TABLE]

If $|\mathcal{B}|\geq(N-s+1)/2$ , we have $\sum_{k\in\mathcal{B}}\gamma_{k}\geq|\mathcal{B}|\min_{k\in\mathcal{B}}\gamma_{k}\geq\frac{N-s+1}{2}\min_{k\in\mathcal{B}}\gamma_{k}$ . By fixing $x=x^{*}$ in (8), it follows from the definition of $\overline{x}_{N,s}$ and the convexity of $f$ that

[TABLE]

Noticing $\sum_{k\in\mathcal{N}}\gamma_{k}\eta_{k}\geq 0$ , it follows that

[TABLE]

We then have (13) by the two inequalities (15) and (16). Next we prove (14). For any $k\in\mathcal{B}$ , we have $g(x_{k},\delta_{k})\leq\eta_{k}$ and so $G(x_{k})\leq\eta_{k}+\varepsilon_{k}$ . From the definition of $\overline{x}_{N,s}$ and the convexity of $G\left(\cdot\right)$ , we then have

[TABLE]

∎

Now we may prove the error bounds for inexact CSA for the generally convex case.

Proof of Theorem 3.2.

The bound on the optimality gap comes from (13). Recall (12), i.e. we have $2D_{\mathcal{X}}^{2}+\sum_{k\in\mathcal{B}}\gamma_{k}^{2}L_{f}^{2}+\sum_{k\in\mathcal{N}}\gamma_{k}^{2}L_{g,\mathcal{X}}^{2}\leq 3D_{\mathcal{X}}^{2}$ . From $s=\lceil\frac{N}{2}\rceil\leq\frac{N}{2}+1$ , $\gamma_{k}=\frac{D_{\mathcal{X}}}{\sqrt{k}(L_{f}+L_{g,\mathcal{X}})}$ , and $\mathcal{B}\subset\{s,\ldots,N\}$ , we obtain $(N-s+1)\min_{k\in\mathcal{B}}\gamma_{k}\geq\frac{N}{2}\frac{D_{\mathcal{X}}}{\sqrt{N}(L_{f}+L_{g,\mathcal{X}})}=\frac{\sqrt{N}D_{\mathcal{X}}}{2(L_{f}+L_{g,\mathcal{X}})}$ . It then follows from (13) that

[TABLE]

The bound on constraint violation is by (14). For any $k\in\mathcal{B}\subset\{\lceil\frac{N}{2}\rceil,\lceil\frac{N}{2}\rceil+1,\ldots,N\}$ , we have $\gamma_{k}\eta_{k}=\frac{6D_{\mathcal{X}}^{2}}{k}\leq\frac{12D_{\mathcal{X}}^{2}}{N}$ and $\gamma_{k}\geq\frac{D_{\mathcal{X}}}{\sqrt{N}(L_{f}+L_{g,\mathcal{X}})}$ . It then follows from (14) that

[TABLE]

∎

5.1.2 Strongly Convex Case

Now we consider the general error bounds for the strongly convex case (Theorem 3.6). The following lemma will be used in subsequent results, its proof is straightforward and so the details are skipped. We remind the reader that $\{A_{k}\}_{k\geq 1}$ is defined in Section 3.

Lemma 5.4.

For all $k\geq 1$ , let $a_{k}\in(0,1]$ and $A_{k}>0$ . If sequences $\{\triangle_{k}\}_{k\geq 1}$ and $\{B_{k}\}_{k\geq 1}$ satisfy $\triangle_{k+1}\leq(1-a_{k})\triangle_{k}+B_{k}$ for all $k\geq 1$ , then for any $1\leq s\leq k$ we have

[TABLE]

We remind the reader that $\{\rho_{k}\}_{k\geq 1}$ in the next result is originally defined in Section 3.

Proposition 5.5.

Suppose Assumption 3.4 holds. Choose stepsizes $\{\gamma_{k}\}_{k\geq 1}$ , tolerances $\{\eta_{k}\}_{k\geq 1}$ , and $1\leq s\leq N$ . Let $\left\{x_{k}\right\}_{k\geq 1}$ be produced according to Algorithm 1, then

[TABLE]

for all $x\in\mathcal{X}$ .

Proof.

Consider an iteration $s\leq k\leq N$ . If $k\in\mathcal{B}$ , then by Lemma 2.4 and Assumption 3.4, we have

[TABLE]

Similarly, for $k\in\mathcal{N}$ , by Lemma 2.4 and Assumption 3.4, we have

[TABLE]

Invoking Lemma 5.4, we then obtain

[TABLE]

Rearranging the terms in the above inequality and recalling the definition of $\rho_{k}$ , we arrive at (17). ∎

The following result provides a sufficient condition for $\overline{x}_{N,s}$ to be well-defined.

Lemma 5.6.

Suppose Assumption 3.4 and the condition

[TABLE]

hold. Then, $\mathcal{B}\neq\emptyset$ , i.e., $\overline{x}_{N,s}$ is well-defined. Furthermore, either (i) $|\mathcal{B}|\geq(N-s+1)/2$ or (ii) $\sum_{k\in\mathcal{B}}\rho_{k}(f(x_{k})-f(x^{*}))<0$ holds.

Proof.

By fixing $x=x^{*}$ in (17), we obtain

[TABLE]

If $\sum_{k\in\mathcal{B}}\rho_{k}(f(x_{k})-f(x^{*}))\geq 0$ , noticing $g(x^{*},\delta_{k})\leq G(x^{*})\leq 0$ , we have

[TABLE]

Suppose that $|\mathcal{B}|<(N-s+1)/2$ , i.e., $|\mathcal{N}|\geq(N-s+1)/2$ . Then, by assumption we have

[TABLE]

which contradicts (19). Thus, condition (i) holds. Alternatively, if $\sum_{k\in\mathcal{B}}\rho_{k}(f(x_{k})-f(x^{*}))<0$ then condition (ii) holds. ∎

Based on the above lemma, we may now prove Lemma 3.5.

Proof of Lemma 3.5.

From the selections of $\{\eta_{k}\}_{k\geq 1}$ , $\{\gamma_{k}\}_{k\geq 1}$ , and $s$ in (5), we have for $k\in\{1,2,\ldots,N\}$ , $a_{k}=\frac{2}{k+1}$ , $A_{k}=\frac{2}{(k+1)k}$ , and

[TABLE]

Specifically, $a_{s}=a_{1}=1$ , and

[TABLE]

which implies that

[TABLE]

Also, we have

[TABLE]

By Lemma 5.6, we have $\mathcal{B}\neq\emptyset$ , i.e., $\overline{x}_{N,s}$ is well-defined. ∎

Before we prove Theorem 3.6, we establish the main convergence properties of Algorithm 1 in the following proposition.

Proposition 5.7.

Suppose Assumption 3.4 holds, and suppose that $\{\gamma_{k}\}_{k\geq 1}$ and $\{\eta_{k}\}_{k\geq 1}$ are chosen such that (18) holds. Let $\left\{x_{k}\right\}_{k\geq 1}$ be generated according to Algorithm 1. Then, for any $1\leq s\leq N$ we have

[TABLE]

and

[TABLE]

Proof.

We first show that (21) holds. By Lemma 5.6, we have two cases. If $\sum_{k\in\mathcal{B}}\rho_{k}(f(x_{k})-f(x^{*}))<0$ holds, using the convexity of $f$ and the definition of $\overline{x}_{N,s}$ , we obtain $f(\overline{x}_{N,s})-f(x^{*})<0$ which implies (21). If $|\mathcal{B}|\geq(N-s+1)/2$ , then we have $\sum_{k\in\mathcal{B}}\rho_{k}\geq\min_{\left\{\mathcal{A}\subset I:\left|\mathcal{A}\right|=\left\lceil(N-s+1)/2\right\rceil\right\}}\sum_{k\in\mathcal{A}}\rho_{k}$ . Take $x=x^{*}$ in (17), from Assumptions** A2**, A4, the definition of $\overline{x}_{N,s}$ , and the fact that $g(x^{*},\delta_{k})\leq G(x^{*})\leq 0$ , we have

[TABLE]

Noticing $\sum_{k\in\mathcal{N}}\rho_{k}\eta_{k}\geq 0$ , it follows that (21) holds.

Next we prove (22). For any $k\in\mathcal{B}$ , we have $g(x_{k},\delta_{k})\leq\eta_{k}$ by definition. Then, for any $k\in\mathcal{B}$ we must have $G(x_{k})\leq\eta_{k}+\varepsilon_{k}$ . From the definition of $\overline{x}_{N,s}$ , and the convexity of $G$ , we obtain

[TABLE]

∎

We now have the machinery in place to prove the error bound for inexact CSA in the strongly convex case.

Proof of Theorem 3.6.

We bound the optimality gap by (21) as follows. Recall (20), we have

[TABLE]

Further, we have $\min_{\left\{\mathcal{A}\subset I:\left|\mathcal{A}\right|=\left\lceil(N-s+1)/2\right\rceil\right\}}\sum_{k\in\mathcal{A}}\rho_{k}\geq\sum_{k=1}^{\left\lceil N/2\right\rceil}\frac{Lk}{\max\left\{\mu_{f},\mu_{g}\right\}}\geq\frac{LN(N+1)}{8\max\left\{\mu_{f},\mu_{g}\right\}}$ . It then follows from (21) that

[TABLE]

Next, we bound the constraint violation by (22). Noticing that $\eta_{k}=\frac{8L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}$ is a constant, it immediately follows from (22) that

[TABLE]

∎

5.2 CSA with Fixed Sampling

In this subsection we develop the proofs for our fixed sampling scheme. At each iteration $k$ , $x_{k}$ is fixed, and we face the cut generation Problem (2) which can be written in epigraph form (where the index $k$ is omitted):

[TABLE]

We repeat the definition of uniform level-set bound (ULB) from [12] as follows.

Definition 5.8.

[12, Definition 3.1] For fixed $x\in\mathcal{X}$ , the tail probability of the worst-case violation is the function $p:\mathbb{R}_{+}\rightarrow[0,1]$ defined by $p(\alpha):=Q\{\delta\in\Delta:g(x,\delta)>G(x)-\alpha\}$ . We call $h:[0,1]\rightarrow\mathbb{R}_{+}$ a uniform level-set bound (ULB) of $p$ if for all $\varepsilon\in[0,1]$ , $h(\varepsilon)\geq\sup\{\kappa\in\mathbb{R}_{+}:p(\kappa)\leq\varepsilon\}$ .

Let $\delta^{(1)},\delta^{(2)},\ldots,\delta^{(M)}$ be i.i.d. samples generated according to a probability distribution $Q$ . The sampled problem derived from Problem (23) is

[TABLE]

which is equivalent to $\max_{i=1,2,\ldots,M}g(x,\delta^{(i)})$ .

Let $\widehat{g}_{M}(x)$ be the unique solution of Problem (24). This optimal solution $\widehat{g}_{M}(x)$ is a random variable that depends on the samples $\delta^{(1)},\delta^{(2)},\ldots,\delta^{(M)}$ . As a direct application of Theorem 3.6 in [12], we have the following key result.

Proposition 5.9.

Consider the Problems (23) and (24) for fixed $x\in\mathcal{X}$ with the associated optimal values $G(x)$ and $\widehat{g}_{M}(x)$ , respectively. Given a ULB $h$ and $\varepsilon$ , $\beta$ in $[0,1]$ , for all $M\geq M(\varepsilon,\beta)$ , we have $Q^{M}\{G(x)-\widehat{g}_{M}(x)\in[0,h(\varepsilon)]\}\geq 1-\beta$ .

From Proposition 5.9, we see that for fixed $x\in\mathcal{X}$ the gap between $\widehat{g}_{M}(x)$ and $G(x)$ is effectively quantified by a ULB $h(\varepsilon)$ . To control the behavior of $h(\varepsilon)$ as $\varepsilon\rightarrow 0$ , we require more structure on the probability distribution $Q$ on $\Delta$ , which is imposed in Assumption 4.1. The next result is based on Assumption 4.1.

Proposition 5.10.

[12, Proposition 3.8]** Under Assumption 4.1, the function $h(\varepsilon):=L_{g,\Delta}\varphi^{-1}(\varepsilon)$ is a ULB, where $\varphi^{-1}$ is the inverse of $\varphi$ .

From Propositions 5.9 and 5.10, we obtain the following bound in probability.

Proposition 5.11.

Suppose Assumption 4.1 holds. Given $\epsilon>0$ and $\beta\in(0,1)$ , for $M\geq M(\varphi(\frac{\epsilon}{L_{g,\Delta}}),\beta)$ i.i.d. samples from $Q$ , we have $Q^{M}\{G(x)-\max_{1\leq i\leq M}g(x,\delta^{(i)})\leq\epsilon\}\geq 1-\beta$ .

Now we can estimate the empirical constraint violation for the fixed sampling scheme.

Proof of Proposition 4.2.

From Proposition 5.11, we have

[TABLE]

Therefore, we have

[TABLE]

∎

Next, we give the proof for Theorem 4.3 (for the generally convex case under the fixed sampling scheme). The proof uses Proposition 4.2 to control the error terms in our general inexact CSA analysis.

Proof of Theorem 4.3.

From Proposition 4.2, we have $\mathbb{E}_{Q^{M_{k}}}\left[\varepsilon_{k}\right]\leq\frac{(L_{f}+L_{g,\mathcal{X}})D_{\mathcal{X}}}{\sqrt{k}}$ for all $k\geq 1$ . For $k\in\mathcal{B}$ with $k>\frac{N}{2}$ , we have $\mathbb{E}_{Q^{M_{k}}}\left[\frac{\varepsilon_{k}}{\sqrt{k}}\right]\leq\frac{2D_{\mathcal{X}}(L_{f}+L_{g,\mathcal{X}})}{N}$ . Moreover, $\frac{1}{\sqrt{k}}\geq\frac{1}{\sqrt{N}}$ . Thus, from independence of samples, we have

[TABLE]

Subsequently, Theorem 3.2 gives $\mathbb{E}_{\mathcal{Q}}\left[G(\overline{x}_{N,s})\right]\leq 14D_{\mathcal{X}}(L_{f}+L_{g,\mathcal{X}})/\sqrt{N}$ . ∎

The proof of Theorem 4.4 (for the strongly convex case) is as follows.

Proof of Theorem 4.4.

From Proposition 4.2, we have $\mathbb{E}_{Q^{M_{k}}}\left[\varepsilon_{k}\right]\leq\frac{L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}$ . It follows that

[TABLE]

Therefore, from Theorem 3.6, we arrive at the inequality $\mathbb{E}_{\mathcal{Q}}\left[G(\overline{x}_{N,s})\right]\leq 9L\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}/N$ . ∎

5.3 CSA with Adaptive Sampling

This subsection considers the adaptive sampling scheme. First, we need to prove two prerequisite Lemmas 4.6 and 4.7. Lemma 4.6 establishes an equivalence between the nonlinear finite-dimensional optimization problem $\max_{\delta\in\Delta}g(x,\delta)$ and an infinite-dimensional linear optimization problem $\max_{\phi\in\mathcal{P}(\Delta)}\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\widetilde{\delta}\right)\right]$ .

Proof of Lemma 4.6.

The existence of a maximizer $\delta^{\ast}(x)\in\arg\max_{\delta\in\Delta}g(x,\,\delta)$ can be guaranteed by Assumptions** A3**, A5. On one hand, for any $\phi\in\mathcal{P}(\Delta)$ ,

[TABLE]

Since $\phi$ is arbitrary, we have $\max_{\delta\in\Delta}g(x,\delta)\geq\max_{\phi\in\mathcal{P}(\Delta)}\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\widetilde{\delta}\right)\right]$ . On the other hand, we can put all mass of $\phi$ on $\delta^{\ast}(x)$ , i.e., the Dirac measure $\phi=\delta_{\delta^{\ast}(x)}$ , thus $\mathbb{E}_{\widetilde{\delta}\sim\delta_{\delta^{\ast}(x)}}\left[g\left(x,\,\widetilde{\delta}\right)\right]=\max_{\delta\in\Delta}g(x,\delta)$ , which implies $\max_{\delta\in\Delta}g(x,\delta)\leq\max_{\phi\in\mathcal{P}(\Delta)}\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\widetilde{\delta}\right)\right]$ . ∎

Lemma 4.7 justifies the existence of a solution of the regularized cut generation Problem (6), and provides a closed form expression.

Proof of Lemma 4.7.

By Theorem 15.11 in [1], $\mathcal{P}(\Delta)$ is compact in the weak-star topology since $\Delta$ is compact. Further, the mapping $\phi\rightarrow\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\,\widetilde{\delta}\right)\right]$ is continuous with respect to the weak-star topology in $\mathcal{P}(\Delta)$ from Assumption A5, the mapping $\phi\rightarrow D\left(\phi,\,\phi_{u}\right)$ is lower semi-continuous with respect to the weak-star topology in $\mathcal{P}(\Delta)$ by invoking Theorem 5.27 in [14], and so $\phi\rightarrow\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\,\widetilde{\delta}\right)\right]-\kappa\,D\left(\phi,\,\phi_{u}\right)$ is upper semi-continuous in $\phi\in\mathcal{P}(\Delta)$ with respect to the weak-star topology. Therefore, the maximizer of $\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\,\widetilde{\delta}\right)\right]-\kappa\,D\left(\phi,\,\phi_{u}\right)$ is attained in $\phi\in\mathcal{P}(\Delta)$ .

Let $\mathcal{M}_{+}\left(\Delta\right)$ denote the space of non-negative measures on $\Delta$ . We note that the regularized cut generation Problem (6) is a constrained calculus of variations problem:

[TABLE]

By using Euler’s equation in the calculus of variations (see Section 7.5 in [36]), we obtain after simplification,

[TABLE]

where $C=\upsilon-\kappa\log(p_{u})-\kappa$ and $\upsilon$ is the Lagrange multiplier of the constraint $\int_{\Delta}\phi(\delta)d\delta=1$ . From (25) and the constraint $\int_{\Delta}\phi(\delta)d\delta=1$ , we obtain the expression

[TABLE]

∎

The following lemma is an intermediate result, where we use the Assumption 4.5 that $\Delta$ is full dimensional and convex. It is used in the proof of Proposition 4.8, which paves the way for the cut generation result for the adaptive sampling scheme. Recall that $\Gamma(\cdot)$ is the gamma function.

Lemma 5.12.

Suppose Assumption 4.5 holds. For any $\kappa\in(0,1]$ and $x\in\mathcal{X},$ we have

[TABLE]

Proof.

First, we have

[TABLE]

where $\delta^{\ast}(x)\in\arg\max_{\delta\in\Delta}g(x,\,\delta)$ , and the last inequality follows since $\max_{\delta\in\Delta}g(x,\,\delta)-g(x,\delta)=g(x,\delta^{\ast}(x))-g(x,\delta)$$\leq L_{g,\Delta}\left\|\delta^{\ast}(x)-\delta\right\|$ due to Assumption A5. It is then sufficient to show

[TABLE]

Let $\delta_{\kappa}:=\kappa\delta_{0}+(1-\kappa)\delta^{\ast}(x)$ . Since $\Delta$ is convex by Assumption 4.5, we deduce

[TABLE]

which implies that, for any $\delta\in B_{\kappa R_{\Delta}}(\delta_{\kappa}),$ there exists $\delta^{\prime}\in B_{R_{\Delta}}(\delta_{0})$ such that $\delta=\kappa\delta^{\prime}+(1-\kappa)\delta^{\ast}(x)$ . Then, for any $\delta\in B_{\kappa R_{\Delta}}(\delta_{\kappa}),$ we have

[TABLE]

Therefore,

[TABLE]

where the first inequality is by (28) and the second is by (29), and the equality follows since $\int_{B_{\kappa R_{\Delta}}(\delta_{\kappa})}1d\delta=\frac{\pi^{d/2}}{\Gamma(d/2+1)}(\kappa R_{\Delta})^{d}$ is the volume of the Euclidean ball with radius $\kappa R_{\Delta}$ in $\mathbb{R}^{d}$ . ∎

Now we are in a position to establish Proposition 4.8.

Proof of Proposition 4.8.

By replacing (26) in $\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\,\widetilde{\delta}\right)\right]-\kappa\,D\left(\phi,\,\phi_{u}\right)$ , we obtain after simplification,

[TABLE]

Applying (27) to bound the term $\log\left(\int_{\Delta}\exp\left(g(x,\delta)/\kappa\right)d\delta\right)$ in the right hand side of (30), we obtain

[TABLE]

Since $\kappa=\min\left\{\frac{\epsilon}{2C},\left(\frac{\epsilon}{2d}\right)^{2},1\right\}$ , we have

[TABLE]

where the first inequality holds since $\kappa\leq\frac{\epsilon}{2C}$ , the second holds because $\log(\kappa)\geq-\frac{1}{\sqrt{\kappa}}$ , and the last one follows from $\kappa\leq\left(\frac{\epsilon}{2d}\right)^{2}$ . Therefore, we have $\max_{\phi\in\mathcal{\mathcal{P}}(\Delta)}\left\{\mathbb{E}_{\widetilde{\delta}\sim\phi}\left[g\left(x,\,\widetilde{\delta}\right)\right]-\kappa\,D\left(\phi,\,\phi_{u}\right)\right\}\geq G(x)-\epsilon$ . Moreover, since $\phi_{\kappa,\,x}$ solves the regularized cut generation Problem (6), and since the regularization parameter $\kappa$ and the Kullback-Liebler divergence $D\left(\phi,\,\phi_{u}\right)$ are non-negative, we arrive at the conclusion. ∎

The bound for cut generation under the adaptive sampling scheme (Proposition 4.9) is an immediate result from Proposition 4.8.

Proof of Proposition 4.9.

Since $\delta^{(1)},\delta^{(2)},\ldots,\delta^{(M)}$ are i.i.d. samples from probability density $\phi_{\kappa(\epsilon),\,x}$ , we have from Proposition 4.8, $\mathbb{E}_{\delta^{(i)}\sim\phi_{\kappa(\epsilon),\,x}}\left[g\left(x,\delta^{(i)}\right)\right]\geq G(x)-\epsilon$ , for $i=1,2,\ldots,M$ . Therefore, as long as $M\geq 1$ , we have $\mathbb{E}_{\phi_{\kappa(\epsilon),\,x}^{M}}\left[\max_{1\leq i\leq M}g(x,\delta^{(i)})\right]\geq G(x)-\epsilon$ . ∎

We now prove our main result Theorem 4.10 (for the generally convex case) under the adaptive sampling scheme. We need to use Proposition 4.9 to control the error terms.

Proof of Theorem 4.10.

From Proposition 4.9, we have $\mathbb{E}_{\phi_{\kappa(\epsilon_{k}),\,x_{k}}^{M_{k}}}[\varepsilon_{k}]\leq\frac{(L_{f}+L_{g,\mathcal{X}})D_{\mathcal{X}}}{\sqrt{k}}$ . Furthermore, $\frac{N}{2}\leq k\leq N$ for $k\in\mathcal{B}$ , and by independence of samples we have

[TABLE]

Subsequently, Theorem 3.2 gives $\mathbb{E}_{\mathcal{P}}\left[G(\overline{x}_{N,s})\right]\leq 14D_{\mathcal{X}}(L_{f}+L_{g,\mathcal{X}})/\sqrt{N}$ . ∎

The proof of Theorem 4.11 (for the strongly convex case) under the adaptive sampling scheme is as follows.

Proof of Theorem 4.11.

From Proposition 4.9, we have $\mathbb{E}_{\phi_{\kappa(\epsilon_{k}),\,x_{k}}^{M_{k}}}[\varepsilon_{k}]\leq\frac{L}{N}\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}$ . It follows that

[TABLE]

Therefore, applying Theorem 3.6 gives $\mathbb{E}_{\mathcal{P}}\left[G(\overline{x}_{N,s})\right]\leq 9L\max\left\{\mu_{f},\mu_{g}\right\}\max\left\{\frac{L_{f}^{2}}{\mu_{f}^{2}},\frac{L_{g,\mathcal{X}}^{2}}{\mu_{g}^{2}}\right\}/N$ . ∎

6 Numerical Experiments

This section applies our methods to a simple test problem adapted from [5] to illustrate the theory developed in this paper. Let $\delta_{i}\in\mathbb{R}^{2}$ for all $i=1,\ldots,4$ denote uncertain parameters such that $\left\|\delta_{i}\right\|\leq 1$ . We want to solve the following optimization problem:

[TABLE]

where

[TABLE]

We compare Algorithm 1 with the fixed constraint sampling and the adaptive constraint sampling schemes, respectively. The parameters in policy (4) are inherently conservative. In this experiment, we adjust the parameters $\gamma_{k}$ and $\eta_{k}$ by multiplying them with scaling parameters $c_{g}$ and $c_{e}$ , respectively. These scaling parameters are chosen by doing pilot runs (see [28]). Under fixed constraint sampling, we set $M_{k}$ to be constant in all iterations, and we consider $M_{k}\in\left\{10,20,50,100\right\}$ . Under adaptive constraint sampling, we generate the probability distribution $\phi_{\kappa,x}$ by the Metropolis-Hastings (MH) algorithm (see e.g. [46]), where we run the MH algorithm for 200 iterations and then take one sample to solve the cut generation problem.

Table 1 reports the results. As we can see, even though we only generate one sample in each iteration under the adaptive sampling scheme, the objective value achieved is $-1.560$ , which is close to the true optimal value $-1.559$ . Figure 1(a) illustrates the convergence of the algorithms and Figure 1(b) shows the constraint violation under different sampling schemes. In particular, we note that under the fixed sampling scheme, the constraint violation decreases as the sample size increases. Note that we scale the parameters $\gamma_{k}$ and $\eta_{k}$ in policy (4) in the experiment, which may result in the failure of Lemma 3.1. We see from Figure 2, with the parameter adjustment, that $\mathcal{B}\neq\emptyset$ and $|\mathcal{B}|$ is at least linearly increasing in $N$ , so that our theoretical analysis is still valid in this case (which depends on this property of $\mathcal{B}$ ).

We generate the probability distribution $\phi_{\kappa,x}$ by the MH algorithm, and perform sensitivity analysis on the number of iterations of MH. We provide the associated objective values $f\left(\overline{x}_{N,s}\right)$ by fixing $N=10^{3}$ . We can see from Figure 3 that the adaptive sampling scheme achieves a high-performance solution (with relative gap smaller than $0.1\%$ ) when the MH algorithm runs for 200 iterations.

From these experiments, we observe the inherent trade-off between the two sampling schemes. Under fixed sampling, although only a fixed sampling distribution is used along all iterations, we need to generate batch samples to achieve good performance. In contrast, under adaptive sampling, we need extra effort to generate samples, but only need one sample at each iteration.

7 Conclusion

In this work, we combine CSA (as originally developed in [29]) with inexact cut generation to solve SIPs. Since the cut generation problem is typically intractable, we emphasize random constraint sampling to approximately solve this problem. In our first approach, we rely on a fixed constraint sampling distribution. Our second approach adaptively updates the constraint sampling distribution, based on the current iterate. The major advantage of adaptive over fixed sampling is that, theoretically, it only requires one sample at each iteration.

As our main contribution, we provide general error bounds (in terms of the error in solving each cut generation problem) for inexact CSA. We show that both our sampling schemes achieve an $\mathcal{O}(1/\sqrt{N})$ rate of convergence in expectation, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex. We also improve this rate to $\mathcal{O}(1/N)$ in the strongly convex case.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] CD Aliprantis and KC Border. Infinite dimensional analysis. a hitchhiker’s guide. 2006.
2[2] Bruno Betrò. An accelerated central cutting plane algorithm for linear semi-infinite programming. Mathematical Programming , 101(3):479–495, 2004.
3[3] Nikhil Bhat, Vivek Farias, and Ciamac C Moallemi. Non-parametric approximate dynamic programming via the kernel method. In Advances in Neural Information Processing Systems , pages 386–394, 2012.
4[4] J Frédéric Bonnans and Alexander Shapiro. Perturbation Analysis of Optimization Problems . Springer Science & Business Media, 2013.
5[5] Giuseppe Calafiore and M.C. Campi. Uncertain convex programs: randomized solutions and confidence levels. Mathematical Programming Series A , 102:25–46, 2005.
6[6] Marco C Campi and Simone Garatti. The exact feasibility of randomized solutions of uncertain convex programs. SIAM Journal on Optimization , 19(3):1211–1230, 2008.
7[7] Darinka Dentcheva and Andrzej Ruszczynski. Optimization with stochastic dominance constraints. SIAM Journal on Optimization , 14(2):548–566, 2003.
8[8] Darinka Dentcheva and Andrzej Ruszczyński. Optimality and duality theory for stochastic optimization problems with nonlinear dominance constraints. Mathematical Programming , 99(2):329–350, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Corporative Stochastic Approximation with Random Constraint Sampling

Abstract

1 Introduction

1.1 Previous work

1.2 Contributions

Notation

2 Preliminaries

Assumption 2.1**.**

Definition 2.2**.**

Definition 2.3**.**

Lemma 2.4**.**

3 General Error Bounds for Inexact CSA

Lemma 3.1**.**

Theorem 3.2**.**

Remark 3.3*.*

Assumption 3.4**.**

Lemma 3.5**.**

Theorem 3.6**.**

Remark 3.7*.*

4 Random Constraint Sampling

4.1 Fixed Constraint Sampling

Assumption 4.1**.**

Proposition 4.2**.**

Theorem 4.3**.**

Theorem 4.4**.**

4.2 Adaptive Constraint Sampling

Assumption 4.5**.**

Lemma 4.6**.**

Lemma 4.7**.**

Proposition 4.8**.**

Proposition 4.9**.**

Theorem 4.10**.**

Theorem 4.11**.**

Remark 4.12*.*

5 Proofs of Main Results

5.1 General Error Bounds Analysis

5.1.1 General Convex Case

Proposition 5.1**.**

Proof.

Lemma 5.2**.**

Proof.

Proof of Lemma 3.1.

Proposition 5.3**.**

Proof.

Proof of Theorem 3.2.

5.1.2 Strongly Convex Case

Lemma 5.4**.**

Proposition 5.5**.**

Proof.

Lemma 5.6**.**

Proof.

Proof of Lemma 3.5.

Proposition 5.7**.**

Proof.

Proof of Theorem 3.6.

5.2 CSA with Fixed Sampling

Definition 5.8**.**

Proposition 5.9**.**

Proposition 5.10**.**

Proposition 5.11**.**

Proof of Proposition 4.2.

Proof of Theorem 4.3.

Proof of Theorem 4.4.

5.3 CSA with Adaptive Sampling

Proof of Lemma 4.6.

Proof of Lemma 4.7.

Lemma 5.12**.**

Proof.

Proof of Proposition 4.8.

Proof of Proposition 4.9.

Proof of Theorem 4.10.

Proof of Theorem 4.11.

6 Numerical Experiments

Assumption 2.1.

Definition 2.2.

Definition 2.3.

Lemma 2.4.

Lemma 3.1.

Theorem 3.2.

*Remark 3.3**.*

Assumption 3.4.

Lemma 3.5.

Theorem 3.6.

*Remark 3.7**.*

Assumption 4.1.

Proposition 4.2.

Theorem 4.3.

Theorem 4.4.

Assumption 4.5.

Lemma 4.6.

Lemma 4.7.

Proposition 4.8.

Proposition 4.9.

Theorem 4.10.

Theorem 4.11.

*Remark 4.12**.*

Proposition 5.1.

Lemma 5.2.

Proposition 5.3.

Lemma 5.4.

Proposition 5.5.

Lemma 5.6.

Proposition 5.7.

Definition 5.8.

Proposition 5.9.

Proposition 5.10.

Proposition 5.11.

Lemma 5.12.