A Data Efficient and Feasible Level Set Method for Stochastic Convex   Optimization with Expectation Constraints

Qihang Lin; Selvaprabu Nadarajah; Negar Soheili; Tianbao Yang

arXiv:1908.03077·math.OC·January 3, 2020

A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints

Qihang Lin, Selvaprabu Nadarajah, Negar Soheili, Tianbao Yang

PDF

TL;DR

This paper introduces a stochastic feasible level set method (SFLS) for solving large-scale stochastic convex optimization problems with expectation constraints, emphasizing early feasibility and low data complexity.

Contribution

The paper develops a novel SFLS algorithm that maintains feasibility at each step and improves data efficiency over existing methods for stochastic convex optimization with constraints.

Findings

01

SFLS achieves high-probability feasibility at each iteration.

02

SFLS demonstrates lower data complexity than existing methods.

03

Numerical results show faster feasible solution attainment with small optimality gaps.

Abstract

Stochastic convex optimization problems with expectation constraints (SOECs) are encountered in statistics and machine learning, business, and engineering. In data-rich environments, the SOEC objective and constraints contain expectations defined with respect to large datasets. Therefore, efficient algorithms for solving such SOECs need to limit the fraction of data points that they use, which we refer to as algorithmic data complexity. Recent stochastic first order methods exhibit low data complexity when handling SOECs but guarantee near-feasibility and near-optimality only at convergence. These methods may thus return highly infeasible solutions when heuristically terminated, as is often the case, due to theoretical convergence criteria being highly conservative. This issue limits the use of first order methods in several applications where the SOEC constraints encode implementation…

Tables1

Table 1. Table 1: Characteristics of multi-class classification datasets from LIBSVM library

Dataset	Number of classes	Number of instances	Number of features
connect-4	3	67557	126
covtype	7	581012	54
news20	20	15935	62061

Equations262

f^{*} := x \in X min {f_{0} (x) = E [F_{0} (x, ξ_{0})]} s.t. f_{i} (x) := E [F_{i} (x, ξ_{i})] \leq r_{i}, i = 1, 2, \dots, m,

f^{*} := x \in X min {f_{0} (x) = E [F_{0} (x, ξ_{0})]} s.t. f_{i} (x) := E [F_{i} (x, ξ_{i})] \leq r_{i}, i = 1, 2, \dots, m,

H (r) := x \in X min P (r, x)

H (r) := x \in X min P (r, x)

P (r, x) := max {f_{0} (x) - r, f_{1} (x) - r_{1}, \dots, f_{m} (x) - r_{m}} .

P (r, x) := max {f_{0} (x) - r, f_{1} (x) - r_{1}, \dots, f_{m} (x) - r_{m}} .

β := - \frac{H ( r ^{(0)} )}{r ^{(0)} - f ^{*}} \in (0, 1] .

β := - \frac{H ( r ^{(0)} )}{r ^{(0)} - f ^{*}} \in (0, 1] .

\frac{2 θ ^{2}}{β} ln (\frac{θ ^{2}}{β ϵ})

\frac{2 θ ^{2}}{β} ln (\frac{θ ^{2}}{β ϵ})

H (r) = x \in X min y \in Y max {i = 0 \sum m y_{i} (f_{i} (x) - r_{i})},

H (r) = x \in X min y \in Y max {i = 0 \sum m y_{i} (f_{i} (x) - r_{i})},

H (r) = x \in X min y \in Y max ϕ (x, y) .

H (r) = x \in X min y \in Y max ϕ (x, y) .

\displaystyle G({\mathbf{x}},{\mathbf{y}},{\bm{\xi}}):=\left[\begin{array}[]{c}G_{x}({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})\\ -G_{y}({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})\end{array}\right]:=\left[\begin{array}[]{c}\sum_{i=0}^{m}y_{i}F^{\prime}_{i}({\mathbf{x}},\xi_{i})\\ -(F_{0}({\mathbf{x}},\xi_{0})-r_{0},F_{1}({\mathbf{x}},\xi_{1})-r_{1},\dots,F_{m}({\mathbf{x}},\xi_{m})-r_{m})^{\top}\end{array}\right].

\displaystyle G({\mathbf{x}},{\mathbf{y}},{\bm{\xi}}):=\left[\begin{array}[]{c}G_{x}({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})\\ -G_{y}({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})\end{array}\right]:=\left[\begin{array}[]{c}\sum_{i=0}^{m}y_{i}F^{\prime}_{i}({\mathbf{x}},\xi_{i})\\ -(F_{0}({\mathbf{x}},\xi_{0})-r_{0},F_{1}({\mathbf{x}},\xi_{1})-r_{1},\dots,F_{m}({\mathbf{x}},\xi_{m})-r_{m})^{\top}\end{array}\right].

V (z^{'}, z) := ω_{z} (z) - [ω_{z} (z^{'}) + \nabla ω_{z} (z^{'})^{⊤} (z - z^{'})] .

V (z^{'}, z) := ω_{z} (z) - [ω_{z} (z^{'}) + \nabla ω_{z} (z^{'})^{⊤} (z - z^{'})] .

\overset{ˉ}{z}^{(t)} := (\overset{ˉ}{x}^{(t)}, \overset{ˉ}{y}^{(t)}) := \frac{\sum _{s = 0}^{t} γ _{s} z ^{(s)}}{\sum _{s = 0}^{t} γ _{s}},

\overset{ˉ}{z}^{(t)} := (\overset{ˉ}{x}^{(t)}, \overset{ˉ}{y}^{(t)}) := \frac{\sum _{s = 0}^{t} γ _{s} z ^{(s)}}{\sum _{s = 0}^{t} γ _{s}},

z^{(t + 1)} := (x^{(t + 1)}, y^{(t + 1)}) := P_{z^{(t)}} (γ_{t} G (x^{(t)}, y^{(t)}, ξ^{(t)})) .

ω_{z} (z) := \frac{ω _{x} ( x )}{2 D _{x}^{2}} + \frac{ω _{y} ( y )}{2 D _{y}^{2}} .

ω_{z} (z) := \frac{ω _{x} ( x )}{2 D _{x}^{2}} + \frac{ω _{y} ( y )}{2 D _{y}^{2}} .

\displaystyle g({\mathbf{x}},{\mathbf{y}})\in\left[\begin{array}[]{c}\partial_{x}\phi({\mathbf{x}},{\mathbf{y}})\\ \partial_{y}[-\phi({\mathbf{x}},{\mathbf{y}})]\end{array}\right],

\displaystyle g({\mathbf{x}},{\mathbf{y}})\in\left[\begin{array}[]{c}\partial_{x}\phi({\mathbf{x}},{\mathbf{y}})\\ \partial_{y}[-\phi({\mathbf{x}},{\mathbf{y}})]\end{array}\right],

E [exp (∥ G_{x} (x, y, ξ) ∥_{*, x}^{2} / M_{x}^{2})]

E [exp (∥ G_{x} (x, y, ξ) ∥_{*, x}^{2} / M_{x}^{2})]

E [exp (∥ G_{y} (x, y, ξ) ∥_{*, y}^{2} / M_{y}^{2})]

E [exp (∣ Φ (x, y, ξ) - ϕ (x, y) ∣^{2} / Q^{2})]

M

M

Ω (δ)

W (δ, ϵ_{A}) := max {6, (\frac{8 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}} ln (\frac{4 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}}))^{2} - 2}

W (δ, ϵ_{A}) := max {6, (\frac{8 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}} ln (\frac{4 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}}))^{2} - 2}

u_{*}^{(t)} := y \in Y max {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [ϕ (x^{(s)}, y^{(s)}) + g_{y} (x^{(s)}, y^{(s)})^{⊤} (y - y^{(s)})]} .

u_{*}^{(t)} := y \in Y max {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [ϕ (x^{(s)}, y^{(s)}) + g_{y} (x^{(s)}, y^{(s)})^{⊤} (y - y^{(s)})]} .

\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [ϕ (x^{(s)}, y^{(s)}) + g_{y} (x^{(s)}, y^{(s)})^{⊤} (y - y^{(s)})] \geq \frac{\sum _{s = 0}^{t} γ _{s} ϕ ( x ^{(s)} , y )}{\sum _{s = 0}^{t} γ _{s}} \geq ϕ (\overset{ˉ}{x}^{(t)}, y),

\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [ϕ (x^{(s)}, y^{(s)}) + g_{y} (x^{(s)}, y^{(s)})^{⊤} (y - y^{(s)})] \geq \frac{\sum _{s = 0}^{t} γ _{s} ϕ ( x ^{(s)} , y )}{\sum _{s = 0}^{t} γ _{s}} \geq ϕ (\overset{ˉ}{x}^{(t)}, y),

u_{*}^{(t)} \geq U (\overset{ˉ}{x}^{(t)}) = y \in Y max ϕ (\overset{ˉ}{x}^{(t)}, y) \geq H (r),

u_{*}^{(t)} \geq U (\overset{ˉ}{x}^{(t)}) = y \in Y max ϕ (\overset{ˉ}{x}^{(t)}, y) \geq H (r),

\overset{ˉ}{z}^{(t)} := (\overset{ˉ}{x}^{(t)}, \overset{ˉ}{y}^{(t)}) := \frac{\sum _{s = 0}^{t} γ _{s} z ^{(s)}}{\sum _{s = 0}^{t} γ _{s}},

\overset{ˉ}{z}^{(t)} := (\overset{ˉ}{x}^{(t)}, \overset{ˉ}{y}^{(t)}) := \frac{\sum _{s = 0}^{t} γ _{s} z ^{(s)}}{\sum _{s = 0}^{t} γ _{s}},

z^{(t + 1)} := (x^{(t + 1)}, y^{(t + 1)}) := P_{z^{(t)}} (γ_{t} G (x^{(t)}, y^{(t)}, ξ^{(t)})) .

\overset{u}{^}_{*}^{(t)} := y \in Y max {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [Φ (x^{(s)}, y^{(s)}, ξ^{(s)}) + G_{y} (x^{(s)}, y^{(s)}, ξ^{(s)})^{⊤} (y - y^{(s)})]} .

\overset{u}{^}_{*}^{(t)} := y \in Y max {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [Φ (x^{(s)}, y^{(s)}, ξ^{(s)}) + G_{y} (x^{(s)}, y^{(s)}, ξ^{(s)})^{⊤} (y - y^{(s)})]} .

l_{*}^{(t)} := x \in X min {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [ϕ (x^{(s)}, y^{(s)}) + g_{x} (x^{(s)}, y^{(s)})^{⊤} (x - x^{(s)})]} .

l_{*}^{(t)} := x \in X min {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [ϕ (x^{(s)}, y^{(s)}) + g_{x} (x^{(s)}, y^{(s)})^{⊤} (x - x^{(s)})]} .

\hat{l}_{*}^{(t)} := x \in X min {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [Φ (x^{(s)}, y^{(s)}, ξ^{(s)}) + G_{y} (x^{(s)}, y^{(s)}, ξ^{(s)})^{⊤} (x - x^{(s)})]} .

\hat{l}_{*}^{(t)} := x \in X min {\frac{1}{\sum _{s = 0}^{t} γ _{s}} s = 0 \sum t γ_{s} [Φ (x^{(s)}, y^{(s)}, ξ^{(s)}) + G_{y} (x^{(s)}, y^{(s)}, ξ^{(s)})^{⊤} (x - x^{(s)})]} .

max {6, (\frac{8 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}} ln (\frac{4 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}}))^{2} - 2}

max {6, (\frac{8 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}} ln (\frac{4 ( 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}}))^{2} - 2}

max {6, (\frac{8 ( Q Ω ( δ ) + 8 M Ω ( δ ) + 2.5 M )}{ϵ _{A}} ln (\frac{4 ( Q Ω ( δ ) + 8Ω ( δ ) M + 2.5 M )}{ϵ _{A}}))^{2} - 2}

max {6, (\frac{8 ( Q Ω ( δ ) + 8 M Ω ( δ ) + 2.5 M )}{ϵ _{A}} ln (\frac{4 ( Q Ω ( δ ) + 8Ω ( δ ) M + 2.5 M )}{ϵ _{A}}))^{2} - 2}

T (δ, ϵ_{A}) := max {6, (\frac{16 ( Q Ω ( δ ) + 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}} ln (\frac{8 ( Q Ω ( δ ) + 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}}))^{2} - 2}

T (δ, ϵ_{A}) := max {6, (\frac{16 ( Q Ω ( δ ) + 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}} ln (\frac{8 ( Q Ω ( δ ) + 10 M Ω ( δ ) + 4.5 M )}{ϵ _{A}}))^{2} - 2}

\frac{2 θ ^{2}}{β} ln (\frac{θ ^{2}}{β ϵ})

\frac{2 θ ^{2}}{β} ln (\frac{θ ^{2}}{β ϵ})

O (\frac{θ ^{2}}{β ϵ ^{2}} \cdot ln (\frac{θ ^{2}}{β ϵ}) \cdot ln^{2} (\frac{1}{δ}) \cdot ln^{2} (\frac{1}{ϵ}))

O (\frac{θ ^{2}}{β ϵ ^{2}} \cdot ln (\frac{θ ^{2}}{β ϵ}) \cdot ln^{2} (\frac{1}{δ}) \cdot ln^{2} (\frac{1}{ϵ}))

O (\frac{θ ^{2}}{β} ln (\frac{θ ^{2}}{( 1 - θ ) β ϵ}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\OneAndAHalfSpacedXI\TheoremsNumberedThrough\ECRepeatTheorems\EquationsNumberedThrough\MANUSCRIPTNO

MS-17-01263.R2

\RUNAUTHOR

\RUNTITLE

Stochastic Level-Set Method

\TITLE

A Data Efficient and Feasible Level Set Method for Stochastic Convex Optimization with Expectation Constraints

\ARTICLEAUTHORS\AUTHOR

Qihang Lin \AFFTippie College of Business, The University of Iowa, 21 East Market Street, Iowa City, IA 52242, USA, \[email protected] \AUTHORSelvaprabu Nadarajah \AFFCollege of Business Administration, University of Illinois at Chicago, 601 South Morgan Street, Chicago, Illinois, 60607, USA, \[email protected] \AUTHORNegar Soheili \AFFCollege of Business Administration, University of Illinois at Chicago, 601 South Morgan Street, Chicago, Illinois, 60607, USA, \[email protected] \AUTHORTianbao Yang \AFFCDepartment of Computer Science, The University of Iowa, 21 East Market Street, Iowa City, IA 52242, USA, \[email protected]

\ABSTRACT

Stochastic convex optimization problems with expectation constraints (SOECs) are encountered in statistics and machine learning, business, and engineering. In data-rich environments, the SOEC objective and constraints contain expectations defined with respect to large datasets. Therefore, efficient algorithms for solving such SOECs need to limit the fraction of data points that they use, which we refer to as algorithmic data complexity. Recent stochastic first order methods exhibit low data complexity when handling SOECs but guarantee near-feasibility and near-optimality only at convergence. These methods may thus return highly infeasible solutions when heuristically terminated, as is often the case, due to theoretical convergence criteria being highly conservative. This issue limits the use of first order methods in several applications where the SOEC constraints encode implementation requirements. We design a stochastic feasible level set method (SFLS) for SOECs that has low data complexity and emphasizes feasibility before convergence. Specifically, our level-set method solves a root-finding problem by calling a novel first order oracle that computes a stochastic upper bound on the level-set function by extending mirror descent and online validation techniques. We establish that SFLS maintains a high-probability feasible solution at each root-finding iteration and exhibits favorable iteration complexity compared to state-of-the-art deterministic feasible level set and stochastic subgradient methods. Numerical experiments on three diverse applications validate the low data complexity of SFLS relative to the former approach and highlight how SFLS finds feasible solutions with small optimality gaps significantly faster than the latter method.

1 Introduction

Consider the stochastic optimization problem with expectation constraints (SOEC)

[TABLE]

where $\mathcal{X}\subset\mathbb{R}^{d}$ is a nonempty closed convex set, $\xi_{i}$ , $i=0,1,\ldots,m$ , is a random vector whose probability distribution is supported on set $\Xi_{i}\subseteq\mathbb{R}^{q_{i}}$ , and $F_{i}({\mathbf{x}},\xi_{i}):\mathcal{X}\times\Xi_{i}\rightarrow\mathbb{R}$ is continuous and convex in ${\mathbf{x}}$ for each realization of $\xi_{i}$ for $i=0,1,2,\dots,m$ . Given $\epsilon>0$ , a solution ${\mathbf{x}}_{\epsilon}\in\mathcal{X}$ is called $\epsilon$ -feasible if $\max_{i=1,\dots,m}\{f_{i}({\mathbf{x}}_{\epsilon})-r_{i}\}\leq\epsilon$ . A solution ${\mathbf{x}}_{\epsilon}\in\mathcal{X}$ is referred to as $\epsilon$ -optimal if $f_{0}({\mathbf{x}}_{\epsilon})-f^{*}\leq\epsilon$ . Alternatively, optimality can be measured relative to an initial feasible solution ${\mathbf{x}}^{0}\in\mathcal{X}$ . In this case, we say ${\mathbf{x}}_{\epsilon}\in\mathcal{X}$ is relative $\epsilon$ -optimal with respect to ${\mathbf{x}}^{0}$ if $(f({\mathbf{x}}_{\epsilon})-f^{*})/(f({\mathbf{x}}^{0})-f^{*})\leq\epsilon$ .

Problem (1) is pervasive in stochastic optimization and appears as a central challenge in semi-supervised learning (Chapelle et al. 2009), shape-restricted regression (Seijo et al. 2011, Sen and Meyer 2017, Lim 2014, Cotter et al. 2016, Fard et al. 2016), Neyman-Pearson classification (Tong et al. 2016, Rigollet and Tong 2011, Tong 2013, Zhao et al. 2015), approximate linear programming and related relaxations (de Farias and Van Roy 2003, Adelman and Mersereau 2013, Nadarajah et al. 2015), portfolio selection (Markowitz 1952, Abdelaziz et al. 2007), risk management (Rockafellar and Uryasev 2000), supply chain design (Azaron et al. 2008), and multi-objective stochastic programming (Marler and Arora 2004, Abdelaziz 2012, Mahdavi et al. 2013, Barba-Gonzaléz et al. 2017). In this paper, we focus on overcoming the challenges of applying existing methods for solving SOECs in settings that are both data rich and where expectation constraints capture requirements that cannot be violated during real-world implementation.

In data-rich environments, each expectation appearing in (1) is defined by a data set containing a large number of data points (possibly infinite). The number of data points used when solving SOEC is an important computational bottleneck, which we refer to as the data complexity of an algorithm. Traditional approaches for solving SOECs can lead to large data complexity. For instance, consider the popular strategy of replacing each expectation in (1) by a sample average approximation (SAA; Shapiro 2013, Oliveira and Thompson 2017) and solving the resulting model using a deterministic iterative method (see, e.g., Nesterov 2004, Soheili and Pena 2012, and references therein). If the number of samples used to construct SAAs is small, the solution from the deterministic approximation may be highly infeasible to the original SOEC, in addition to being suboptimal (Shapiro 2013, Oliveira and Thompson 2017). Instead, if a large number of samples are used in each SAA, then the data complexity becomes large because the gradient or objective function evaluation at each iteration requires using a significant portion of each of the data sets.

In contrast, stochastic first order methods for tackling stochastic optimization problems have low per-iteration cost and data complexity and thus play a central role in machine learning packages such as TensorFlow and PyTorch (Robbins and Monro 1951, Nemirovski et al. 2009, Lan 2012, Ghadimi and Lan 2012, 2013, Chen et al. 2012, Lan et al. 2012, Schmidt et al. 2013, Shalev-Shwartz et al. 2017, Lan and Zhou 2015, Lin et al. 2014, Duchi and Singer 2009, Xiao and Zhang 2014, Xiao 2010, Hazan and Kale 2011, Bach and Moulines 2013, Allen-Zhu 2017, Goldfarb et al. 2017). These methods update solutions using stochastic gradients that can be computed using a small number of sampled data points. Stochastic first order methods typically ensure feasibility via projections onto a convex set at each iteration, where the convex set is assumed to be simple (e.g. a box or ball) for computational tractability. This assumption limits the applicability of first order methods for solving SOECs with general non-linear constraints. Recently, Lan and Zhou (2016) and Yu et al. (2017) developed stochastic subgradient (SSG) methods devoid of projections for solving (1) with single ( $m=1$ ) and multiple constraints ( $m>1$ ), respectively. The SSG methods in these papers guarantee an $\epsilon$ -optimal and $\epsilon$ -feasible solution only at convergence.

In practice, SSG methods are terminated before their conservative theoretical conditions are met. Premature termination may lead to highly infeasible and sub- or super- optimal solutions. While some deviation from optimality is likely acceptable, a highly infeasible solution may not be implementable. Such situations arise in several data science applications in machine learning, as well as, across business (e.g., operations and finance) and engineering domains. We elaborate on the practical need for feasibility in a few cases below.

•

Fairness constraints: Enforcing fairness criteria when learning classifiers across multiple classes (e.g., male and female) has become important in machine learning (Goh et al. 2016). This learning problem can be cast as an SOEC where fairness is modeled via expectation constraints. Constraint violations lead to classifiers that are biased towards one or more classes.

•

Risk constraints: Planning problems in supply chain management and portfolio optimization often include bounds on the Conditional Value at Risk (CVaR), which can be cast as expectation constraints (Fábián 2008, Chen et al. 2010). Such constraints also arise when modeling distributionally robust versions of chance constraints (Wiesemann et al. 2014) and when limiting misclassification risk (i.e., misclassification rates) in multi-class Neyman Pearson classification (Weston and Watkins 1998, Crammer and Singer 2002). The aforementioned problems can be formulated as SOECs. Solutions violating risk constraints will likely fail stress tests that are performed before implementation.

•

Bounding property: Approximate linear programs (ALPs) are well-known models for approximating the value function of high-dimensional Markov decision processes (Schweitzer and Seidmann 1985, de Farias and Van Roy 2003), and in particular, are SOECs. A solution satisfying the ALP constraints provides an optimistic bound on the optimal policy value, which is useful to evaluate the suboptimality of heuristic policies. Infeasibility in an ALP setting thus voids this desirable bounding property.

Motivated by the importance of feasibility and the status quo of stochastic first order methods, we design an approach for solving SOECs that has low data complexity and provides high probability feasible solutions before convergence. As a first step, we cast SOEC as a root-finding problem involving a min-max level set function, which is challenging to solve because it is non-smooth and includes high-dimensional expectations in the SOEC objective and constraints. To solve this reformulation, we develop a stochastic feasible level-set method (SFLS) for root finding that requires evaluating a “good” upper bound (we will make this notion of goodness precise in later sections) on the challenging level set function at each iteration. We show that employing the mirror descent method (Nemirovski et al. 2009) for computing such an upper bound requires approximating expectations in SOEC using SAAs at each iteration, which as already discussed above, leads to high data complexity. To overcome this issue, we introduce an SSG method to upper bound the level-set function by combining mirror-decent and online validation techniques, and in particular, extending the latter technique, originally proposed for minimization problems (Lan et al. 2012), to handle saddle point formulations. This method only requires stochastic values and gradients of the objective and constraint functions, respectively, which can be constructed at low cost using a small number of samples of $\xi_{i}$ in (1), that is, it has low data complexity. Calls to our SSG method return high-probability feasible solutions, which allows it to maintain an implementable solution at each root-finding iteration.

We analyze the iteration complexity of SFLS to find a feasible solution path (i.e., sequence of feasible solutions) that becomes relative $\epsilon$ -optimal with high probability. It is encouraging that the dependence of this complexity on $\epsilon$ is $1/\epsilon^{2}$ , which is comparable to the method by Yu et al. (2017) (labeled YNW111We abbreviate this method by YNW using the first letters of the last names of the authors.) that also finds an $\epsilon$ -optimal solution but only guarantees $\epsilon$ -feasibility at convergence. In other words, the intermediate solutions generated by YNW are not necessarily feasible. There is indeed a cost for ensuring feasibility in SFLS, which appears in the form of its iteration complexity depending on a condition measure. Such condition measures do not influence the complexity of YNW.

For deterministic constrained convex optimization problems, the level-set method (DFLS) of Lin et al. (2018b) also guarantees a feasible solution path with its iteration complexity depending on a condition measure. In principle, these DFLS based approaches can be applied to solve SOECs by viewing them as deterministic problems. This perspective is restrictive because it entails computing expectations in $f_{i}$ for $i=0,1,\dots,m$ exactly or replacing them by SAAs. In either case, the data complexity of DFLS will be high for reasons analogous to the ones already discussed above related to the use of SAAs. Therefore, a fully stochastic approach is required to achieve low data complexity when solving SOECs. Lin et al. (2018a) extend DFLS using variance-reduced sampling, which requires the functions have a finite-sum structure with each summand taking a specific form.222In particular, Lin et al. (2018a) require each summand has the form of $\phi({\mathbf{x}}^{\top}\mathbf{\xi})$ . Unfortunately, as a result, their method cannot be applied to SOECs with generic expectation while our method does not have such limitation and assumes little structure on the problems. We are not aware of prior efforts to develop a fully stochastic versions of level set methods for SOECs – SFLS in this paper fills this gap.

To assess the performance of SFLS, we provide implementation guidelines with supporting theory and numerically evaluate SFLS on three applications: (i) approximate linear programming for Markov decision processes, (ii) Neyman-Pearson multi-class classification with risk constraints, and (iii) learning a classifier with fairness constraints. Feasibility plays a key role in each of these applications for reasons mentioned earlier in the introduction. Approximate linear programs in the first application are known special cases of SOECs. For the latter two applications, we propose formulations that are SOECs. As algorithmic benchmarks, we consider YNW and DFLS. We find that SFLS delivers feasible solutions quicker than YNW and in several cases also leads to smaller optimality gaps. Moreover, when YNW computes infeasible solutions it is challenging to interpret its objective value since it can be superoptimal, an issue that does not arise with SFLS. Both SFLS and DFLS maintain feasible solution paths (with outer iterates) but SFLS produces feasible solutions with much smaller optimality gaps due to its lower data complexity. In other words, DFLS requires significantly more data passes to reduce the suboptimality of its solutions and will thus not be practical for solving SOECs based on large data sets. Our findings underscore two important algorithmic insights: (i) feasible SOEC solutions can be computed well before theoretical convergence criteria are satisfied but doing this hinges on methods being able to emphasize feasibility; and (ii) ensuring that these early feasible solutions have small optimality gaps requires approaches with low data complexity. Both these properties are true for SFLS, while only the first and second properties, respectively, hold for DFLS and YNW.

This paper is organized as follows. In §2, we introduce SFLS, analyze its oracle complexity, and present a saddle-point reformulation of an SOEC. In §3, we discuss how the well-known stochastic mirror descent algorithm provides an idealized stochastic oracle for SFLS and highlight issues that complicate its use. In §4, we propose and analyze a new stochastic oracle to overcome these issues. In §5, we analyze SFLS combined with this oracle and provide implementation guidelines. In §6, we perform a computational study to understand the performance of SFLS across three applications relative to two benchmark methods. We conclude in §7.

2 Stochastic Feasible Level-set Method

Level-set methods tackle a constrained convex optimization problem by transforming it into a one-dimensional root-finding problem that is a function of a scalar level parameter $r$ (Lemaréchal et al. 1995, Nesterov 2004). We develop in this section a stochastic and feasible level set method that adds to this framework. We make the following standard assumption throughout the paper, which ensures that a strictly feasible and sub-optimal solution exists. {assumption}[Strict Feasibility] There exists a strictly feasible solution $\tilde{\mathbf{x}}\in\mathcal{X}$ such that

$\max_{i=1,\ldots,m}\{f_{i}(\tilde{\mathbf{x}})-r_{i}\}<0$ and $f_{0}(\tilde{\mathbf{x}})>f^{*}$ .

The root-finding reformulation of (1) relies on the level-set function

[TABLE]

where $r\in\mathbb{R}$ is a level parameter and

[TABLE]

Note that the expectation constraints of SOEC are now in the objective function of (2). For a given $(r,{\mathbf{x}})\in\mathbb{R}\times\mathcal{X}$ , if $\mathcal{P}(r,{\mathbf{x}})\leq 0$ then ${\mathbf{x}}$ is a feasible solution to (1). Formulations (1) and (2) are further linked by known properties of $H(r)$ , which are summarized in the following lemma (based on lemmas 2.3.4 and 2.3.6 in Nesterov 2004 and Lemma 1 in Lin et al. 2018b).

Lemma 2.1

It holds that

(a)

$H(r)$ * is non-increasing and convex in $r$ ;*

(b)

$H(f^{*})=0$ ;

(c)

$H(r)>0$ , if $r<f^{*}$ and $H(r)<0$ , if $r>f^{*}$ .

Part (a) of Lemma 2.1 highlights that $H(r)$ is non-increasing and convex. Moreover, its part (b) implies that $r=f^{*}$ is the unique root of $H(r)=0$ . Therefore, one can use a root finding procedure to generate both a sequence of level parameters $r^{(1)},r^{(2)},\dots$ that converges to $f^{*}$ and an associated vector ${\mathbf{x}}^{(k)}:=\argmin_{{\mathbf{x}}\in\mathcal{X}}\mathcal{P}(r^{(k)},{\mathbf{x}})$ at each iteration $k$ . Computationally, when a level parameter $r^{(k^{*})}\approx f^{*}$ is found, the solution ${\mathbf{x}}^{(k^{*})}:=\argmin_{{\mathbf{x}}\in\mathcal{X}}\mathcal{P}(r^{(k^{*})},{\mathbf{x}})$ provides an “approximate” solution to (1). From the perspective of feasibility, it is important whether we have $r^{(k^{*})}<f^{*}$ or $r^{(k^{*})}>f^{*}$ . To elaborate, if $r^{(k^{*})}<f^{*}$ , then $H(r^{(k^{*})})>0$ by Lemma 2.1(c) and the corresponding solution ${\mathbf{x}}^{(k^{*})}$ need not be feasible to (1). On the other hand, if $r^{(k^{*})}>f^{*}$ , we have $H(r^{(k^{*})})=\mathcal{P}(r^{(k^{*})},{\mathbf{x}}^{(k^{*})})<0$ from Lemma 2.1(c) and the vector ${\mathbf{x}}^{(k^{*})}$ is indeed a feasible solution. A root finding scheme that ensures $r^{(k)}>f^{*}$ at each iteration $k$ will thus return a sequence of feasible solutions ${\mathbf{x}}^{(1)},{\mathbf{x}}^{(2)},\ldots,{\mathbf{x}}^{(k^{*})}$ , that is a feasible solution path, where $k^{*}$ is such that $f^{*}<r^{(k^{*})}<f^{*}+\epsilon$ for a given $\epsilon>0$ and, in addition, we have $f_{0}({\mathbf{x}}^{(k^{*})})\leq r^{(k^{*})}$ from $\mathcal{P}(r^{(k^{*})},{\mathbf{x}}^{(k^{*})})<0$ . These inequalities imply that $f_{0}({\mathbf{x}}^{(k^{*})})-f^{*}\leq\epsilon$ . Thus, ${\mathbf{x}}^{(k^{*})}$ is an $\epsilon$ -optimal and feasible solution to (1) and it follows that solving SOEC can be cast as a root-finding problem involving $H(r)$ .

Applying a root-finding algorithm to solve $H(r)=0$ requires the exact computation of $H(r)$ at each iteration, which is difficult due to the nontrivial stochastic optimization in (2). Hence, we consider an inexact root-finding method, henceforth stochastic feasible level set method (SFLS), extending what is done in Lin et al. (2018b) and Aravkin et al. (2019) in a deterministic setting. Level set methods require an oracle to compute an approximation $U(r)$ of $H(r)$ . This approximation is used to update $r$ . A key element that we develop as part of SFLS is the notion of a stochastic oracle, which we introduce next.

Definition 2.2 (Stochastic Oracle)

Given $r>f^{*}$ , $\epsilon>0$ , and $\delta\in(0,1)$ , a stochastic oracle $\mathcal{A}(r,\epsilon,\delta)$ returns a value $U(r)$ and a vector $\hat{\mathbf{x}}\in\mathcal{X}$ that satisfy the inequalities $\mathcal{P}(r,\hat{\mathbf{x}})-H(r)\leq\epsilon$ and $|U(r)-H(r)|\leq\epsilon$ with a probability of at least $1-\delta$ .

Lemma 2.3 clarifies the importance of the conditions underpinning the above definition to ensure a feasible solution to (1).

Lemma 2.3

Given $r>f^{*}$ , $0<\epsilon\leq-\frac{\theta-1}{\theta+1}H(r)$ , $\delta\in(0,1)$ , and $\theta>1$ , the vector $\hat{\mathbf{x}}\in\mathcal{X}$ returned by a stochastic oracle $\mathcal{A}(r,\epsilon,\delta)$ defines a feasible solution to (1) with probability of at least $1-\delta$ .

This lemma states that a stochastic oracle can recover a high probability feasible solution provided the optimality tolerance $\epsilon$ is less than $-\frac{\theta-1}{\theta+1}H(r)$ .

Algorithm 1 formalizes the steps of SFLS to find an approximate root to $H(r)=0$ . Its inputs include a stochastic oracle $\mathcal{A}$ ; an initial level parameter value $r^{(0)}>f^{*}$ , which exists because we can set $r^{(0)}=f_{0}(\tilde{\mathbf{x}})$ by Assumption 2; optimality and error tolerances $\epsilon_{\text{opt}}$ and $\epsilon_{\mathcal{A}}$ , respectively; a probability $\delta$ ; and a parameter $\theta$ that defines a step length as $1/2\theta$ . SFLS begins from the level set defined by $r^{(0)}$ . At each iteration $k$ it executes lines 3 though 9. In line 3, SFLS computes a probability $\delta^{(k)}$ that is used in the stochastic oracle call of line 4 to obtain an approximation $U(r^{(k)})$ and a high probability feasible solution $x^{(k)}$ . The probability $\delta^{(k)}$ decreases with the iteration count $k$ , that is, the probabilistic guarantee required of the stochastic oracle becomes more stringent to ensure the entire solution path is feasible with probability of at least $1-\delta$ . Lines 5-7 model the termination condition, which involves checking whether the approximation $U(r^{(k)})$ is greater than or equal to $-\epsilon_{\text{opt}}$ . If this condition holds, then the algorithm halts and returns the incumbent solution $x^{(k)}$ . Otherwise, $r^{(k)}$ is updated to $r^{(k+1)}$ in line 8 using $U(r^{(k)})$ and $\theta$ . Line 9 increments the iteration counter. While SFLS belongs to the family of level set approaches, it differs from known deterministic level set methods (see, e.g., Lin et al. 2018b and Aravkin et al. 2019) in its update step, termination criterion, and stochastic oracle.

We define the notion of an input tuple to ease the exposition of theoretical statements in the rest of the paper.

Definition 2.4 (Input tuple)

A tuple containing a subset of the elements $r,r^{(0)},\epsilon,\epsilon_{\mathcal{A}},\delta,\theta$ , and $\gamma_{t}$ is an input tuple if its respective components satisfy $r>f^{*}$ , $r^{(0)}>f^{*}$ , $\epsilon>0$ , $\epsilon_{\mathcal{A}}>0$ , $\delta\in(0,1)$ , $\theta>1$ , and $\gamma_{t}=1/(M\sqrt{t+1})$ , where $M>0$ is a constant that is formally defined in (11).

Theorem 2.5 provides the maximum number of calls to the stochastic oracle by Algorithm 1 to obtain a feasible and relative $\epsilon$ -optimal solution, which depends on a condition measure $\beta$ of SOEC (1) defined as

[TABLE]

It is easy to see that $\beta$ provides an assessment of the slope of $H(r)$ at $r=f^{*}$ . Intuitively, for an SOEC instance with a large $\beta$ (i.e., well conditioned case), a root-finding method will be able to move towards the root of $H(r)$ faster compared to an instance with a small $\beta$ (i.e., ill-conditioned case). See Figure 2.1 of Lin et al. (2018b) for a graphical illustration of this statement.

Theorem 2.5

Given an input tuple $(r^{(0)},\epsilon,\delta,\theta)$ , suppose $\epsilon_{\text{opt}}=-\frac{1}{\theta}H(r^{(0)})\epsilon$ and $\epsilon_{\mathcal{A}}=-\frac{\theta-1}{2\theta^{2}(\theta+1)}H(r^{(0)})\epsilon$ . Algorithm 1 generates a feasible solution at each iteration with a probability of at least $1-\delta$ . Moreover, it returns a relative $\epsilon$ -optimal and feasible solution with this probability in at most

[TABLE]

calls to oracle $\mathcal{A}$ .

The bound on the number of oracle calls increases with $\theta$ because both the step-length $1/2\theta$ and the optimality tolerance $\epsilon_{\text{opt}}$ decrease with $\theta$ . The maximum number of oracle calls is also a decreasing function of both the condition measure $\beta$ and tolerance $\epsilon$ , that is, SFLS requires fewer iterations for problems that are better conditioned and when $\epsilon_{\mathcal{A}}$ and $\epsilon_{\text{opt}}$ are larger. Here, both $\epsilon_{\mathcal{A}}$ and $\epsilon_{\text{opt}}$ require knowledge of $H(r^{(0)})$ , which is difficult to compute exactly. We want to point out that the dependence of $\epsilon_{\mathcal{A}}$ and $\epsilon_{\text{opt}}$ on $H(r^{(0)})$ are introduced here only to simplify the theorem and its proof, which helps readers to understand the main idea behind our technique. In §5, we will show that SFLS has a similar complexity even if $H(r^{(0)})$ in $\epsilon_{\mathcal{A}}$ and $\epsilon_{\text{opt}}$ is replaced by an upper bound $\bar{U}$ with $H(r^{(0)})\leq\bar{U}<0$ and $\bar{U}$ can be computed (by Algorithm 4) in a low cost independent of $\epsilon$ .

SFLS relies on the availability of a valid stochastic oracle $\mathcal{A}$ . Standard subgradient methods cannot be used as oracles to solve (2) since computing a deterministic subgradient of $\mathcal{P}(r,{\mathbf{x}})$ requires exact evaluations of $f_{i}$ for $i=0,1,\dots,m$ (see Bertsekas 1999 or Danskin 2012, p.737), which is challenging due to the high-dimensional expectations in the definition of these functions. Indeed, the expectation in each $f_{i}$ can be replaced by a direct SAA to obtain a sampled version $\hat{\mathcal{P}}(r,{\mathbf{x}})$ of $\mathcal{P}(r,{\mathbf{x}})$ . This replacement is also problematic as subgradients of $\hat{\mathcal{P}}(r,{\mathbf{x}})$ provide biased subgradients of $\mathcal{P}(r,{\mathbf{x}})$ due to the maximization in the definition of the latter function.

To avoid this issue, we reformulate (2) into the equivalent min-max (i.e., saddle-point) form

[TABLE]

where $r_{0}:=r$ and $\mathcal{Y}:=\left\{{\mathbf{y}}=(y_{0},\dots,y_{m})^{\top}\in\mathbb{R}^{m+1}|\sum_{i=0}^{m}y_{i}=1,y_{i}\geq 0\right\}$ . Given ${\mathbf{x}}\in\mathcal{X}$ , it is easy to check that ${\mathbf{y}}^{*}\in\arg\max_{{\mathbf{y}}\in\mathcal{Y}}\sum_{i=0}^{m}y_{i}(f_{i}({\mathbf{x}})-r_{i})$ can be chosen as a unit vector with 1 corresponding to an index $i^{*}\in\argmax_{i=1,\ldots,m}\{f_{i}({\mathbf{x}})-r_{i}\}$ and zeros for the remaining indices. Let $\Xi:=\Xi_{0}\times\Xi_{1}\times\ldots\times\Xi_{m}$ , ${\bm{\xi}}=(\xi_{0},\xi_{1},\dots,\xi_{m})^{\top}\in\Xi$ , $\Phi({\mathbf{x}},{\mathbf{y}},{\bm{\xi}}):=\sum_{i=0}^{m}y_{i}(F_{i}({\mathbf{x}},\xi_{i})-r_{i}),$ and $\phi({\mathbf{x}},{\mathbf{y}}):=\mathbb{E}\left[\Phi({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})\right]$ , where to ease notation we suppress the dependence of $\phi$ and $\Phi$ on the level parameter $r$ since it is always equal to a fixed value when these functions are invoked. Therefore, (4) can be reformulated as

[TABLE]

Let $\hat{\phi}({\mathbf{x}},{\mathbf{y}})$ be an SAA of $\phi({\mathbf{x}},{\mathbf{y}})$ . Subgradients of $\hat{\phi}({\mathbf{x}},{\mathbf{y}})$ provide an unbiased estimate of subgradients of $\phi({\mathbf{x}},{\mathbf{y}})$ because there is no nonlinear operator (e.g., maximization) acting on the expectation defining $\phi$ . The oracles that we discuss for SFLS in §§3-4 will thus solve (5).

3 Idealized Stochastic Oracle

In §3.1, we present stochastic mirror descent (SMD) in the form a stochastic oracle. In §3.2, we establish that SMD is indeed a stochastic oracle that can be used in SFLS (i.e., Algorithm 1) and then highlight computational issues that prevent its use. The discussion here serves a dual role. First, it provides practical motivation and sets the stage for developing a tractable stochastic oracle in §4. Second, it provides basic concepts on primal-dual methods needed throughout the paper, also making the paper more accessible to readers potentially unfamiliar with such methods.

3.1 Stochastic Mirror Descent

Stochastic mirror descent (SMD) (Nemirovski et al. 2009) is a well-known primal-dual method for solving saddle-point problems such as (5). SMD updates primal and dual variables ${\mathbf{x}}$ and ${\mathbf{y}}$ of (5), respectively, by employing stochastic subgradients of $\phi({\mathbf{x}},{\mathbf{y}})$ and a projection operator. Let $F^{\prime}_{i}({\mathbf{x}},\xi_{i})\in\partial F_{i}({\mathbf{x}},\xi_{i})$ for $i=0,1,\dots,m$ , where $\partial$ is the subgradient operator. We denote the stochastic subgradient vector of $\phi({\mathbf{x}},{\mathbf{y}})$ by

[TABLE]

The projection employed by SMD relies on a distance function, known as Bregman divergence, that has as its argument ${\mathbf{z}}:=({\mathbf{x}},{\mathbf{y}})$ and operates over $\mathcal{Z}:=\mathcal{X}\times\mathcal{Y}$ . The space $\mathcal{Z}$ is equipped with a convex and continuously differentiable distance generating function $\omega_{z}({\mathbf{z}})$ modulus 1 and a set of nonzero subgradients $\mathcal{Z}^{o}:=\{{\mathbf{z}}\in\mathcal{Z}|\partial\omega_{z}({\mathbf{z}})\neq\emptyset\}$ . The Bregman divergence $V({\mathbf{z}}^{\prime},{\mathbf{z}}):\mathcal{Z}^{o}\times\mathcal{Z}\rightarrow\mathbb{R}_{+}$ expressed using $\omega_{z}$ is

[TABLE]

The projection operator (or prox-mapping), for any ${\bm{\zeta}}\in\mathbb{R}^{d+m+1}$ , and ${\mathbf{z}}^{\prime}\in\mathcal{Z}^{o}$ , is defined as $P_{{\mathbf{z}}^{\prime}}({\bm{\zeta}}):=\argmin_{{\mathbf{z}}\in\mathcal{Z}}\left\{{\bm{\zeta}}^{\top}({\mathbf{z}}-{\mathbf{z}}^{\prime})+V({\mathbf{z}}^{\prime},{\mathbf{z}})\right\}$ .

Algorithm 2 summarizes the steps of SMD presented in the form of a stochastic oracle. The inputs to this algorithm are a level parameter $r\in\mathbb{R}$ , an optimality tolerance $\epsilon_{\mathcal{A}}>0$ , a probability $\delta\in(0,1)$ , an iteration limit $W(\delta,\epsilon_{\mathcal{A}})$ (we specify this limit later in Proposition 3.1), and a step-length rule $\gamma_{t}$ for all $t\in\mathbb{Z}_{+}$ . Line 2 sets the initial solution ${\mathbf{z}}^{(0)}=({\mathbf{x}}^{(0)},{\mathbf{y}}^{(0)})$ . Algorithm 2 executes lines 4 and 5 for $W(\delta,\epsilon_{\mathcal{A}})$ iterations. At iteration $t$ , line 4 constructs a stochastic subgradient $G({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)},{\bm{\xi}}^{(t)})$ using a sample ${\bm{\xi}}^{(t)}$ of the random variables underlying the expectations in the objective and constraints of (1). Line 5 computes a step-length weighted average $\bar{\mathbf{z}}^{(t)}$ of past solutions. It also uses the stochastic subgradient computed in line 4 and a projection operator to find an updated solution ${\mathbf{z}}^{(t+1)}$ . After exiting the for loop, line 7 uses the averaged primal solution $\bar{\mathbf{x}}^{(t)}$ to compute an upper bound $\max_{{\mathbf{y}}\in\mathcal{Y}}\phi(\bar{\mathbf{x}}^{(t)},{\mathbf{y}})$ on $H(r)$ . The pair $(U(\bar{\mathbf{x}}^{(t)}),\bar{\mathbf{x}}^{(t)})$ is returned in line 8.

It is worth noting that the update in line 5 relies on subgradients of an SAA $\hat{\phi}({\mathbf{x}},{\mathbf{y}})$ (with a single sample), which provides unbiased subgradients of $\phi({\mathbf{x}},{\mathbf{y}})$ , unlike the biased subgradients that arise when working with SAAs of $\mathcal{P}(r,{\mathbf{x}})$ in the primal problem (2). In other words, a key benefit of the primal-dual reformulation (4) is that its objective $\phi({\mathbf{x}},{\mathbf{y}})$ allows the computation of unbiased subgradients after using SAAs to replace exact expectations.

3.2 Validity of Stochastic Oracle and Computational Issues

We analyze below the validity of SMD as a stochastic oracle and also discuss its computational tractability. Our analysis, based on Nemirovski et al. (2009), requires specifying the distance generating function $\omega_{z}$ introduced in §3.1 and stating a standard assumption.

To define $\omega_{z}$ , we equip $\mathcal{X}$ and $\mathcal{Y}$ with their own distance-generating functions $\omega_{x}:\mathcal{X}\rightarrow\mathbb{R}$ modulus $\alpha_{x}$ with respect to norm $\|\cdot\|_{x}$ and $\omega_{y}:\mathcal{Y}\rightarrow\mathbb{R}$ modulus $\alpha_{y}$ with respect to norm $\|\cdot\|_{y}$ . This means that $\omega_{x}$ is $\alpha_{x}$ -strongly convex, continuous on $\mathcal{X}$ , and continuously differentiable on the set of non-zero subgradients $\mathcal{X}^{o}:=\{{\mathbf{x}}\in\mathcal{X}|\partial\omega_{x}({\mathbf{x}})\neq\emptyset\}$ . Similarly, $\omega_{y}$ is $\alpha_{y}$ - strongly convex, continuous on $\mathcal{Y}$ , and continuously differentiable on $\mathcal{Y}^{o}:=\{{\mathbf{y}}\in\mathcal{Y}|\partial\omega_{y}({\mathbf{y}})\neq\emptyset\}$ . Typical choices for $\|\cdot\|_{x}$ and $\|\cdot\|_{y}$ are $\|\cdot\|_{2}$ and $\|\cdot\|_{1}$ , respectively. In addition, it is common to set $w_{x}({\mathbf{x}})=\frac{1}{2}\|{\mathbf{x}}\|^{2}_{2}$ and $\omega_{y}({\mathbf{y}})=\sum_{i=0}^{m}y_{i}\ln y_{i}$ . Defining the diameters of the sets $\mathcal{X}$ and $\mathcal{Y}$ as $D_{x}:=\sqrt{\max_{{\mathbf{x}}\in\mathcal{X}}\omega_{x}({\mathbf{x}})-\min_{{\mathbf{x}}\in\mathcal{X}}\omega_{x}({\mathbf{x}})}$ and $D_{y}:=\sqrt{\max_{{\mathbf{y}}\in\mathcal{Y}}\omega_{y}({\mathbf{y}})-\min_{{\mathbf{y}}\in\mathcal{Y}}\omega_{y}({\mathbf{y}})}$ , the distance-generating function associated with $\mathcal{Z}$ is

[TABLE]

Next, the following standard assumption is needed to analyze SMD as well as other methods in the rest of the paper. Denote by $g({\mathbf{x}},{\mathbf{y}})$ expectation of the $(d+m+1)$ -dimensional vector $G({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})$ , that is, a deterministic subgradient. Moreover, let $\|\cdot\|_{x}$ and $\|\cdot\|_{y}$ represent the dual norms of $\|\cdot\|_{*,x}$ and $\|\cdot\|_{*,y}$ , respectively. {assumption} For any $({\mathbf{x}},{\mathbf{y}},{\bm{\xi}})\in\mathcal{X}\times\mathcal{Y}\times\Xi$ , there exist $F^{\prime}_{i}({\mathbf{x}},\xi_{i})\in\partial F_{i}({\mathbf{x}},\xi_{i})$ for $i=0,1,\dots,m$ such that is well defined and satisfies

[TABLE]

where $\partial_{x}$ and $\partial_{y}$ represent the sub-differentials with respect to ${\mathbf{x}}$ and ${\mathbf{y}}$ , respectively. Moreover, there exist positive constants $M_{x}$ , $M_{y}$ and $Q$ such that

[TABLE]

for any ${\mathbf{x}}\in\mathcal{X}$ and ${\mathbf{y}}\in\mathcal{Y}$ , which indicate that $G_{x}$ and $G_{y}$ have a light-tailed distribution and their moments are bounded.

Proposition 3.1 presents the iteration complexity of SMD, which follows from results in Nemirovski et al. (2009), and in addition, establishes that SMD is a valid stochastic oracle, that is, it satisfies Definition 2.2. The proof of this proposition relies on establishing that the primal-dual gap $U(\bar{\mathbf{x}}^{(t)})-L(\bar{\mathbf{y}}^{(t)})$ is guaranteed to be less than a given $\epsilon_{\mathcal{A}}>0$ with a probability of at least $1-\delta$ for a given $\delta\in(0,1)$ , where $L(\bar{\mathbf{y}}^{(t)}):=\min_{{\mathbf{x}}\in\mathcal{X}}\phi({\mathbf{x}},\bar{\mathbf{y}}^{(t)})$ and $U(\bar{\mathbf{x}}^{(t)})$ is computed in Algorithm 2. We also require the following constants:

[TABLE]

Proposition 3.1

Given an input tuple $(r,\epsilon_{\mathcal{A}},\delta,\gamma_{t})$ , the SMD solution $(\bar{\mathbf{x}}^{(t)},\bar{\mathbf{y}}^{(t)})$ satisfies $U(\bar{\mathbf{x}}^{(t)})-L(\bar{\mathbf{y}}^{(t)})\leq\epsilon_{\mathcal{A}}$ with probability at least $1-\delta$ in at most

[TABLE]

gradient iterations. As a consequence, SMD is a valid stochastic oracle with $W\geq W(\delta,\epsilon_{\mathcal{A}})$ .

When solving (5), the dependence of the iteration complexity on $\epsilon_{\mathcal{A}}$ in Proposition 3.1 has an additional $\ln(1/\epsilon_{\mathcal{A}})$ term compared to the known SMD complexity dependence of $1/\epsilon_{\mathcal{A}}^{2}$ for solving an unconstrained version of this problem. Moreover, the analogous complexity dependence on $\delta$ inside logarithmic terms (see definition of $\Omega(\delta)$ ) in this proposition is comparable to the unconstrained case.

We note that SMD is a valid stochastic oracle, exhibits a favorable iteration complexity, and is based on unbiased subgradients of $\phi({\mathbf{x}},{\mathbf{y}})$ . Nevertheless, SMD is not directly implementable because the upper bound $U(\bar{\mathbf{x}}^{(t)})$ is challenging to compute exactly as the definition of $\phi({\mathbf{x}},{\mathbf{y}})$ embeds expectations. Replacing these expectations by an SAA leads to a biased estimate of the upper bound $U(\bar{\mathbf{x}}^{(t)})$ . This bias can be reduced by using a large number of samples but doing this would lead to an approach with high data complexity, which we would like to avoid. In other words, although our saddle-point formulation facilitates the computation of unbiased subgradients needed by SMD to obtain a near optimal and high probability feasible solution, its upper bound $U(\bar{\mathbf{x}}^{(t)})$ , which serves as the constant $U(r)$ returned by the oracle (see Definition 2.2), cannot be computed.

The aforementioned bound computation challenge is further exacerbated if one wishes to change the stopping criterion of Algorithm 2 (i.e., line 3) from a maximum iteration limit to a bound on the primal-dual gap $U(\bar{\mathbf{x}}^{(t)})-L(\bar{\mathbf{y}}^{(t)})$ . In the latter case, implementing SMD would also entail the computation of the lower bound $L(\bar{\mathbf{y}}^{(t)})$ , which suffers from analogous bias and data complexity issues when expectations in its definition are replaced by SAAs. In addition, the optimization problem over ${\mathbf{x}}$ in the definition of $L(\bar{\mathbf{y}}^{(t)})$ is in general a high-dimensional non-smooth convex optimization problem and solving such a problem multiple times is computationally burdensome. Therefore, it is apriori unclear how one should go about designing a computationally tractable oracle to overcome these issues and what the iteration complexity of such an oracle would be.

4 Tractable Stochastic Oracle

In this section, we design a computationally viable stochastic oracle by combining SMD and an online validation technique (Lan et al. 2012), and in particular, extending the latter technique originally proposed for minimization problems to handle min-max saddle point problems. This oracle overcomes the issues highlighted at the end of §3.2 by defining bounds that are (i) tractable to compute with low data complexity and (ii) do not suffer from the bias issue when replacing expectations in their definitions by SAAs, as was the case with the bounds $U(\bar{\mathbf{x}}^{(t)})$ and $L(\bar{\mathbf{y}}^{(t)})$ . We present our algorithm in §4.1 and prove that it is a stochastic oracle in §4.2, where we also analyze its complexity.

4.1 Online Validation Based Stochastic Mirror Descent

Algorithm 3 contains the steps of our proposed online validation based stochastic mirror descent (OVSMD) scheme, which differs from Algorithm 2 only in line 7, where the upper bound $U(\bar{\mathbf{x}}^{(t)})$ on $H(r)$ is replaced by $\hat{u}_{*}^{(t)}$ . The quantity $\hat{u}_{*}^{(t)}$ is an approximation of the following upper bound obtained using the online validation technique:

[TABLE]

This upper bound holds because

[TABLE]

where the first inequality is true because $g_{y}$ is a subgradient with respect to ${\mathbf{y}}$ of the function $\phi({\mathbf{x}},{\mathbf{y}})$ , which is concave in ${\mathbf{y}}$ , and the second inequality follows directly from the convexity of $\phi({\mathbf{x}},{\mathbf{y}})$ in ${\mathbf{x}}$ . Therefore, we have

[TABLE]

that is, $u^{(t)}_{*}$ is an upper bound on $H(r)$ , albeit potentially weaker than $U(\bar{\mathbf{x}}^{(t)})$ . Computing $u^{(t)}_{*}$ requires the exact evaluations of $\phi$ , $g_{x}$ and $g_{y}$ , which are not in general available because they involve expectations. In contrast, the term $\hat{u}^{(t)}_{*}$ computed in line 7 of Algorithm 3, which is stochastic approximation of $u^{(t)}_{*}$ , can be easily computed in an online manner by solving a simple linear optimization problem.

As discussed in §3.2, replacing the iteration limit based stopping criterion by one that approximates an optimality gap requires a lower bound on $H(r)$ . Following a similar argument to the upper bounding case above, we define the lower bound

[TABLE]

Since $\phi({\mathbf{x}},{\mathbf{y}})$ is convex in ${\mathbf{x}}$ , it follows that $l^{(t)}_{*}\leq L(\bar{\mathbf{y}}^{(t)})=\min_{{\mathbf{x}}\in\mathcal{X}}\phi({\mathbf{x}},\bar{\mathbf{y}}^{(t)})\leq H(r)$ . Although $l^{(t)}_{*}$ is in general a weaker lower bound than $L(\bar{\mathbf{y}}^{(t)})$ , the former bound is computed by solving a linear optimization problem as opposed to the potentially challenging non-smooth convex optimization problem defining the latter bound. Finally, we employ an online validation based approximation of $l^{(t)}_{*}$ to avoid computing expectations and obtain

[TABLE]

Despite the computational tractability of $\hat{u}^{(t)}_{*}$ and $\hat{l}_{*}^{(t)}$ , these are stochastic quantities and subject to noise. Hence they do not always provide valid bounds on $H(r)$ . In §4.2, we show that $\hat{l}^{(t)}_{*}$ and $\hat{u}^{(t)}_{*}$ are nevertheless sufficiently close to $H(r)$ with high probability after a finite number of iterations (see Theorem 4.2).

4.2 Validity of Stochastic Oracle and Iteration Complexity

We establish here the validity of OVSMD (i.e., Algorithm 3) as a stochastic oracle and derive its iteration complexity. Proposition 4.1 contains the two main ingredients underlying the analysis of OVSMD. Part (i) of this proposition shows that for a given $\epsilon_{\mathcal{A}}>0$ the inequality $u^{(t)}_{*}-l^{(t)}_{*}\leq\epsilon_{\mathcal{A}}$ holds with high probability when $t$ is sufficiently large. In other words, the deterministic quantities $u^{(t)}_{*}$ and $l^{(t)}_{*}$ computed using the OVSMD solutions provide “good” deterministic estimates of the level set function $H(r)$ . This is not directly useful since OVSMD can only compute stochastic approximations of these quantities, as already discussed in §4.1. Part (ii) of Proposition 4.1 establishes that $\hat{u}^{(t)}_{*}$ and $\hat{l}^{(t)}_{*}$ are respectively close stochastic approximations of $u^{(t)}_{*}$ and $l^{(t)}_{*}$ at convergence with high probability. It then follows that the quantities $\hat{u}^{(t)}_{*}$ and $\hat{l}^{(t)}_{*}$ are “good” stochastic estimates of the level set function, and in particular, allows OVSMD to be used as a stochastic oracle.

Proposition 4.1

Given an input tuple $(r,\epsilon_{\mathcal{A}},\delta,\gamma_{t})$ , OVSMD computes $({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)})$ , $t=1,2,3,\ldots,$ such that:

(i)

The inequality $\text{Prob}\{u^{(t)}_{*}-l^{(t)}_{*}>\epsilon_{\mathcal{A}}\}\leq\delta$ holds in at most

[TABLE]

gradient iterations.

(ii)

The inequalities $\text{Prob}\{|\hat{l}^{(t)}_{*}-l^{(t)}_{*}|>\epsilon_{\mathcal{A}}\}\leq\delta$ and $\text{Prob}\{|\hat{u}^{(t)}_{*}-u^{(t)}_{*}|>\epsilon_{\mathcal{A}}\}\leq\delta$ hold in at most

[TABLE]

gradient iterations.

Leveraging Proposition 4.1, Theorem 4.2 shows that OVSMD is a valid stochastic oracle and also presents its iteration complexity.

Theorem 4.2

Given an input tuple $(r,\epsilon_{\mathcal{A}},\delta,\gamma_{t})$ , the OVSMD guarantees $\mathcal{P}(r,\bar{\mathbf{x}}^{(t)})-H(r)\leq\epsilon_{\mathcal{A}}$ and $|\hat{u}_{*}^{(t)}-H(r)|\leq\epsilon_{\mathcal{A}}$ with probability at least $1-\delta$ in at most

[TABLE]

gradient iterations. As a consequence, OVSMD is a valid stochastic oracle with $T\geq T(\delta,\epsilon_{\mathcal{A}})$ .

Despite OVSMD being a tractable oracle, the dependence of its iteration complexity on both $\epsilon_{\mathcal{A}}$ and $\delta$ is identical to the analogous dependence seen with the idealized SMD oracle analyzed in Proposition 3.1. Moreover, in terms of $\epsilon_{\mathcal{A}}$ , OVSMD is only a $\ln(1/\epsilon_{\mathcal{A}})$ worse than the known complexity of SMD in the unconstrained case, where feasibility is not a concern.

5 SFLS with OVSMD as its Stochastic Oracle

In this section, we provide theoretical support for the use of OVSMD as SFLS’s stochastic oracle in §5.1 and then discuss implementation guidelines in §5.2.

5.1 Theoretical Analysis

Theorems 2.5 and 4.2 can be used to derive the (gradient) iteration complexity of SFLS when using OVSMD as the stochastic oracle. We state this complexity in Corollary 5.1.

Corollary 5.1

Given an input tuple $(r^{(0)},\epsilon,\delta,\gamma_{t},\theta)$ , let $\epsilon_{\text{opt}}=-\frac{1}{\theta}H(r^{(0)})\epsilon$ and $\epsilon_{\mathcal{A}}=-\frac{\theta-1}{2\theta^{2}(\theta+1)}H(r^{(0)})\epsilon$ . Moreover, suppose OVSMD with $T=T(\delta,\epsilon_{\mathcal{A}})$ is chosen as the stochastic oracle $\mathcal{A}$ . Then SFLS returns a relative $\epsilon$ -optimal and feasible solution with probability of at least $1-\delta$ using at most

[TABLE]

OVSMD calls and

[TABLE]

gradient iterations.

This complexity result is somewhat idealistic because the inputs to SFLS, namely $\epsilon_{\text{opt}}$ and $\epsilon_{\mathcal{A}}$ , require knowledge of $H(r^{(0)})$ , which is difficult to compute exactly. A possible resolution is to compute an upper bound on $H(r^{(0)})$ , denoted by $\bar{U}$ , such that $H(r^{(0)})\leq\bar{U}<0$ . If $|\bar{U}|$ is much smaller than $|H(r^{(0)})|$ , then the optimality tolerance $\epsilon_{\mathcal{A}}$ will be substantially more stringent and thus lead to a larger complexity than the iteration bound in Corollary 5.1. Therefore, to obtain a complete theoretical assessment of the computational complexity of SLFS with OVSMD, it is important to incorporate the cost of finding a $\bar{U}$ that is comparable to $H(r^{(0)})$ (i.e., $|\bar{U}|=\Omega(|H(r^{(0)})|)$ ).

Fortunately, OVSMD can itself be used to compute the desired $\bar{U}$ . We discuss the intuition behind its use for this purpose and then formally state the result. Recall that $H(r^{(0)})<0$ since $r^{(0)}>f^{*}$ . We consider obtain an upper bound $\bar{U}$ by solving (2) with $r=r^{(0)}$ and a small enough optimality gap. By Theorem 4.2, OVSMD with $r=r^{(0)}$ can guarantee $H(r^{(0)})\leq\hat{u}_{*}^{(t)}+\epsilon_{\mathcal{A}}$ with high probability. This suggests setting $\bar{U}=\hat{u}_{*}^{(t)}+\epsilon_{\mathcal{A}}$ . However, it is a priori unclear how small $\epsilon_{\mathcal{A}}$ should be in order to ensure $\bar{U}<0$ and $|\bar{U}|=\Omega(|H(r^{(0)})|)$ . Therefore, we run OVSMD multiple times, starting from a tolerance $\alpha^{(0)}=\bar{\alpha}$ , geometrically reducing this tolerance after each run, and stopping this procedure once $\bar{U}=\hat{u}_{*}^{(h)}+\alpha^{(h)}<0$ and ${(\hat{u}_{*}^{(h)}-\alpha^{(h)})}/{(\hat{u}_{*}^{(h)}+\alpha^{(h)})}\leq\theta$ hold. We can then use Theorem 4.2 and the condition $\hat{u}_{*}^{(h)}+\alpha^{(h)}<0$ to show that ${|H(r^{(0)})|}/{|\bar{U}|}\leq{(\hat{u}_{*}^{(h)}-\alpha^{(h)})}/{(\hat{u}_{*}^{(h)}+\alpha^{(h)})}\leq\theta$ , which implies $|\bar{U}|=\Omega(|H(r^{(0)})|)$ . We formalize the aforementioned approach in Algorithm 4.

Theorem 5.2 establishes the complexity of employing Algorithm 4 to compute $\bar{U}$ and subsequently running SFLS leveraging this computation.

Theorem 5.2

Given an input tuple $(r^{(0)},\epsilon,\delta,\gamma_{t},\theta)$ , suppose we compute $\bar{U}$ using Algorithm 4 and then execute SFLS to find a relative $\epsilon$ -optimal and feasible solution with a probability of at least $1-\delta$ using $\epsilon_{\text{opt}}=-\frac{1}{\theta}\bar{U}\epsilon$ , $\epsilon_{\mathcal{A}}=-\frac{\theta-1}{2\theta^{2}(\theta+1)}\bar{U}\epsilon$ , and OVSMD with $T=T(\delta,\epsilon_{\mathcal{A}})$ as the stochastic oracle $\mathcal{A}$ . This procedure requires in total at most

[TABLE]

OVSMD calls and

[TABLE]

gradient iterations.

Theorem 5.2 provides a realistic theoretical assessment of the computational burden of solving SOECs using SFLS. Interestingly, it shows that running Algorithm 4 to compute $\bar{U}$ before executing SFLS and replacing the unknown term $H(r^{(0)})$ in the definitions of $\epsilon_{\mathcal{A}}$ and $\epsilon_{\text{opt}}$ with the computed $\bar{U}$ value does not change the overall big- $\mathcal{O}$ oracle and gradient iteration complexities in Corollary 5.1, except for logarithmic terms.

The complexity of SFLS (combined with OVSMD) in Theorem 5.2 is comparable in terms of its dependence on $\epsilon$ and $\delta$ to the complexity of the algorithm in Yu et al. (2017), which does not ensure feasibility. This suggests that our procedure is efficient at ensuring feasibility. The cost of ensuring feasibility, however, appears in the dependence of the SFLS iteration complexity on the condition measure $\beta$ . Such dependence is absent in approaches that do not ensure feasibility.

Another relevant comparison is with the deterministic feasible level set approach (DFLS) of Lin et al. (2018b) and its variant in Lin et al. (2018a), which are both applicable to solve deterministic constrained convex optimization problems. The complexity of DFLS based methods depend on the number of data points that define expectations and thus lead to large data complexity, and in particular, have infinite complexity when expectations are defined over continuous random variables. In contrast, the complexity of SFLS in Theorem 5.2 does not depend on the number of data points. In addition, compared to DFLS, the iteration complexity of SFLS has only additional logarithmic factors involving $\epsilon$ and $\delta$ , which is encouraging, as the stochastic level set algorithm (i.e., Algorithm 1) and OVSMD oracle need to contend with several challenges that arise due to the presence of expectations in SOECs.

In summary, our theoretical analysis of SFLS and comparison with known complexities of state-of-the-art approaches suggests that SFLS is effective in terms of iteration complexity at computing a high probability feasible solution path for SEOCs, a much broader and challenging class of problems than deterministic constrained convex programs. Moreover, a fully stochastic approach such as SFLS is theoretically necessary to achieve low data complexity in this context.

5.2 Implementation Guidelines

As is common with first-order methods, the implementation of SFLS requires parameter tuning. A direct implementation of SFLS in a manner consistent with Theorem 5.2 requires selecting $r^{(0)}$ , $\epsilon$ , $\delta$ , $\theta$ and $\gamma_{t}$ ; estimating constants $M$ and $Q$ (needed to define $T=T(\delta,\epsilon_{\mathcal{A}})$ in OVSMD); and then computing $\bar{U}$ . While these parameters can be estimated or approximated, we suggest a simpler implementation strategy that largely side-steps such tuning. Firstly, we avoid stopping SFLS by pre-specifying optimality tolerance $\epsilon_{\text{opt}}$ and instead stop it based on an outer iteration limit. This is possible because the SFLS outer iterations only affect the suboptimality of the incumbent feasible solution, that is, being a feasible level set method, SFLS can return feasible and implementable solutions when terminated after any number of outer iterations. Secondly, instead of choosing the number of inner iteration in OVSMD as $T=T(\delta,\epsilon_{\mathcal{A}})$ based on pre-specified $\delta$ and $\epsilon_{\mathcal{A}}$ , we directly specify $T$ . According to (16), $T(\delta,\epsilon_{\mathcal{A}})$ is strictly monotonically decreasing in $\epsilon_{\mathcal{A}}$ and thus in $\epsilon$ so that a relative $\epsilon$ -optimal and feasible solution with $\epsilon=\tilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ can be guaranteed. Corollary 5.3 establishes that the convergence of SFLS in this implementation.

Corollary 5.3

Suppose we have an input tuple $(r^{(0)},\gamma_{t},\theta)$ and the iteration limit in OVSMD is $T$ . Given $\delta\in(0,1)$ , SFLS finds a relative $\epsilon$ -optimal and feasible solution with $\epsilon\leq\mathcal{O}\left(\frac{\theta^{4}\ln(1/\delta)\ln(T)\ln\left(T/\beta\right)}{\beta\sqrt{T}}\right)$ and with a probability of at least $1-\delta$ using at most $\mathcal{O}\left(\frac{\theta^{2}}{\beta}\ln\left(\frac{T}{\beta}\right)\right)$ OVSMD calls and $\mathcal{O}\left(\frac{\theta^{2}T}{\beta}\ln\left(\frac{T}{\beta}\right)\right)$ gradient iterations.

Overall, following the aforementioned strategy only requires the choice of $T,\theta$ , $r^{(0)}$ , and $\gamma_{t}$ – a significant reduction in implementation burden.

For choosing $\theta$ and $T$ , we consider a discrete set of values and tune the algorithm, that is, we test the performance of SFLS for a few iterations or data passes for each value, and select the one that leads to the largest decrease in suboptimality. Selecting $r^{(0)}$ is easy when an initial feasible solution $\tilde{{\mathbf{x}}}$ is available because we have $\mathbb{E}\left[F_{0}(\tilde{{\mathbf{x}}},\xi_{0})\right]>f^{*}$ . In this case, we estimate $\mathbb{E}\left[F_{0}(\tilde{{\mathbf{x}}},\xi_{0})\right]$ using an SAA and then set $r^{(0)}$ to a larger value to account for approximation error and ensure we have $r^{(0)}>f^{*}$ . If a feasible solution is not readily available, we can find one by applying a minor modification of Algorithm 4 to solve

[TABLE]

which does not include the term in (4) corresponding to $i=0$ , that is, $f_{0}-r$ . Finally, the step length can be specified as $\gamma_{t}=1/(c\sqrt{t+1})$ for a given constraint $c>0$ , which is tuned. While $c$ is chosen as $M$ in our theoretical analysis to simplify proofs, analogous results hold for a generic constant $c>0$ . We omit these general results for the sake of brevity as they do not change the dependence of our iterations bounds on $\epsilon$ , $\beta$ , and $\delta$ .

6 Numerical Experiments

In this section, we evaluate the numerical performance of SFLS on three diverse SOEC applications: (i) approximate linear programs for solving Markov decision processes, (ii) multi-class Neyman-Pearson classification, and (iii) learning with fairness constraints. SOECs in the first application contain expectations of continuous random variables while those in the second and third applications involve discrete random variables. Our first algorithmic benchmark is the stochastic subgradient method YNW of Yu et al. (2017) as it is the only first order approach (we are aware of) that can handle SOECs with multiple constraints. In addition, we also compare against the deterministic feasible level-set method (DFLS) of Lin et al. (2018b) because it ensures a feasible solution path. Specifically, comparing SFLS and DFLS allows us to evaluate the benefits of the reduced data complexity in our stochastic approach. In §6.1, we describe our computational setup and then the performance of algorithms on applications in §§6.2-6.4.

6.1 Computational Setup

We implemented SFLS, DLFS, and YNW in Matlab running on a 64-bit Microsoft Windows 10 machine with a 2.70 Ghz Intel Core i7-6820HQ CPU and 8GB of memory. We set $\omega_{x}({\mathbf{x}})=\frac{1}{2}\|{\mathbf{x}}\|_{2}^{2}$ and $\omega_{y}({\mathbf{y}})=\sum_{i=0}^{m}y_{i}\ln y_{i}$ in all three algorithms. We followed the guidelines in §5.2 when implementing SFLS and thus had to choose only $r^{(0)}$ , $\theta$ , and $\gamma_{t}$ . We based $r^{(0)}$ on the solution $\tilde{\mathbf{x}}$ . We tuned $\theta$ over the discrete set $\{1.1,2,5\}$ and $T$ over the discrete set $\{50,100,200,300\}$ . We selected $\gamma_{t}=1/(c\sqrt{t+1})$ and tuned $c$ over the set of possible values $\{0.05,0.1,1,2,5\}$ . We employed a mini-batch technique to construct the stochastic gradients in SFLS and YNW.

Similar to SFLS, DFLS solves the subproblem $\min_{{\mathbf{x}}\in\mathcal{X}}\mathcal{P}(r^{(k)},{\mathbf{x}})$ approximately in the $k$ th outer iteration and uses the returned solution ${\mathbf{x}}^{(k)}$ to update $r^{(k)}$ as $r^{(k+1)}\leftarrow r^{(k)}+\mathcal{P}(r^{(k)},{\mathbf{x}}^{(k)})/2$ . Following Lin et al. (2018b), we use the standard subgradient descent method to solve this subproblem and the parameters $r^{(0)}$ and $\gamma_{t}$ and the inner iteration limit $T$ in DLFS are tuned in the same way as in SFLS as described above. To apply DLFS, we constructed a deterministic version of each SOEC using SAAs of expectations. We found, consistent with Lin et al. (2018b), that using SAAs in lieu of expectations over continuous random variables in the perishable control problem (first application) did not sufficiently represent the original problem even when using a large number of samples. We thus omitted DFLS as a benchmark for this application. This was not an issue for the remaining two applications because expectations are defined over discrete random variables. To avoid the quality of SAAs confounding our performance evaluation, we chose instances for these two applications such that expectations can be evaluated exactly, albeit requiring more time.

We followed the guidance in Yu et al. (2017) to setup YNW. Specifically, we chose the control parameters $V$ and $\alpha$ as $V=\sqrt{T}$ and $\alpha=T$ , respectively, as a function of the total number of iterations $T$ , where $V$ is the weight of the gradient of the objective function and $\alpha$ is the weight of the proximal term in the updating equation of ${\mathbf{x}}$ in YNW. Similar to SFLS, we used a mini-batch technique to construct the stochastic gradients and evaluate the objective values.

6.2 Approximate Linear Programming for Markov Decision Processes

Approximate linear programs (ALPs) address the well-known curse of dimensionality associated with directly solving large-scale Markov decision processes (MDPs; Puterman 1994) by computing a value function approximation. We illustrate how our SFLS method can be applied to tackle ALPs, and thus large-scale MDPs, by considering a challenging perishable inventory control problem with partial backlogging and lead time. We begin by presenting the MDP for this problem and refer the reader to Lin et al. (2019) for its derivation and detailed application context.

Consider the management of orders for a single product with a finite life time of $I$ periods and an order lead time of $J$ periods, that is, the product takes $J$ periods to be delivered from when it is ordered and $I$ periods to perish from receipt. The state space of the MDP is represented by the vector

[TABLE]

where $q_{j}$ , $1\leq j\leq J-1$ , denotes the order quantities that will be received $j$ periods from now, and $z_{i}$ , $0\leq i\leq I-1$ , the on-hand inventory with $i$ periods of lifetime remaining. The order quantity $a$ is at most $\bar{a}$ and belongs to the interval $[0,\bar{a}]$ , which implies $z_{i}\in[0,\bar{a}]$ for $i=1,\ldots,I-1$ and $q_{j}\in[0,\bar{a}]$ for $j=1,\ldots,J-1$ . The element $z_{0}$ of the state is bounded below by $l_{s}<0$ to allow limited or partial backlogging, that is, any units backlogged beyond $|l_{s}|$ are lost sales. To ease exposition, we write ${\mathbf{s}}\in\mathcal{S}$ and $a\in\mathcal{A}$ to capture the state and action domains, respectively, and use ${\mathbf{s}}^{0}$ to represent the initial state. Assuming orders are served on a first-come-first-serve basis, the MDP state transitions as

[TABLE]

where $G$ represents stochastic demand with distribution $P_{G}$ . Moreover, the cost associated with ordering $a$ at state ${\mathbf{s}}$ is

[TABLE]

where the per unit lost sale, disposal, purchasing, holding, and backlogging costs are $c_{l}$ , $c_{d}$ , $c_{p}$ , $c_{h}$ , and $c_{b}$ , respectively; $\mathbb{E}$ is taken over $G$ ; and $\gamma\in(0,1)$ is a discount factor. The infinite horizon (discounted cost) MDP formulated using the aforementioned components can be solved using the fixed point equations

[TABLE]

ALPs approximate the high-dimensional MDP value function $V({\mathbf{s}})$ (Schweitzer and Seidmann 1985, de Farias and Van Roy 2003) using a linear combination of basis functions. We construct the ALP value function approximation using an intercept $\tau$ and $B$ basis functions $\phi_{b}:\mathcal{S}\mapsto\mathbb{R}$ , $b=1,\ldots,B$ , that is, $V({\mathbf{s}})\approx\tau+\sum_{b=1}^{B}\theta_{b}\phi_{b}({\mathbf{s}})$ , where $\theta:=(\theta_{1},\ldots,\theta_{B})\in\mathbb{R}^{B}$ is the basis function weight vector. It is common to require that the pair $(\tau,\theta)$ belongs to a compact set $\mathcal{X}$ . The VFA weights are computed by solving

[TABLE]

The feasibility of the ALP constraints is important because it ensures that the objective function of a feasible solution provides a lower bound on the optimal policy value, which can be used to assess the suboptimality of heuristic policies (see, e.g., Proposition 4 in Adelman and Mersereau 2008). Thus, in principle, methods to solve ALP would benefit from emphasizing feasibility as we do in SFLS.

Since the linear program above is semi-infinite, constraint sampling is a popular strategy to approach its solution and obtain a high-probability feasible solution (de Farias and Van Roy 2004). Specifically, suppose we sample $m$ state-action pairs $({\mathbf{s}}_{i},a_{i}),i=1,\ldots,m$ . The ALP with constraints corresponding to these samples takes the form of (1):

[TABLE]

We solve this linear program in our experiments.

Following Lin et al. (2019), we constructed instances with $I=2$ and $J=2$ , chose $P_{G}$ to be a truncated normal in the interval $[0,10]$ with mean $5$ and the standard deviation 2, and fixed $c_{p}$ , $c_{l}$ , $\bar{a}$ , $l_{s}$ , $\gamma$ , and ${\mathbf{s}}^{0}$ equal to $20$ , $100$ , $10$ , $-10$ , $0.95$ , and $(5,0,0)$ , respectively. We experimented with three instances based on the triple $(c_{h},c_{d},c_{b})$ being equal to $(2,10,10)$ , $(5,10,8)$ , and $(2,5,10)$ . We employed eighteen basis functions ( $B=18$ ): $z_{0}$ , $z_{1}$ , $q_{1}$ , and $\{(z_{0}-\nu)_{+},(z_{0}+z_{1}-2\nu)_{+},(z_{0}+z_{1}+q_{1}-3\nu)_{+},(2\nu-z_{0}-z_{1}-q_{1})_{+},(\nu-z_{1}-q_{1})_{+}|\nu\in\{\mathbb{E}[G],G^{0.25},G^{0.5}\}\}$ , where $G^{0.25}$ and $G^{0.5}$ are the 25-th and 50-th quartiles of the demand distribution. The domain for the basis function weights $\mathcal{X}$ was taken to be the box $[0,3000]\times[-5,5]^{B}$ . We chose $m$ as 500.

In all methods, we use the initial solution $\tilde{\mathbf{x}}=(\tilde{\tau},\tilde{}\theta)$ with $\tilde{\tau}=\min_{i=1,\dots,m}\frac{c(s_{i},a_{i})}{1-\gamma}$ and $\tilde{}\theta=\mathbf{0}$ which is feasible for (6.2). Our SFLS implementation uses $r^{(0)}=f_{0}(\tilde{\mathbf{x}})>f^{*}$ , $\theta=1.1$ , and the step length rule $\gamma_{t}=5/\sqrt{t+1}$ . We do not report results for DFLS because, as alluded to in §6.1, obtaining a good deterministic approximation using SAAs is non-trivial for the perishable inventory control problem. We use a mini-batch technique with a batch size of $100$ to construct stochastic estimates of the gradients and function values of $f_{i}$ , $i=0,\ldots,m$ in both SFLS and YNW. In SFLS, we choose the numbers of inner (i.e. $T$ ) and outer iterations to be $200$ and $100$ , respectively, which leads to $20,000$ stochastic gradient steps (inner iterations) in total. Hence, we choose the total number of iterations in YNW as $T=20,000$ so that both methods evaluate the same number of stochastic gradients in total which lead to similar runtime (about 1100 seconds).

Figure 1 displays the performance of SFLS and YNW. The $y$ -axes of the top subfigures report the optimality gap $f_{0}({\mathbf{x}})-f^{*}$ while these axes in the bottom subfigures show the feasibility of solutions by plotting $\max_{i=1,2,\dots,m}\{f_{i}({\mathbf{x}})-r_{i}\}$ . Here, $f_{i}({\mathbf{x}})$ for $i=1,2,\dots,m$ are calculated by approximating the expectations in their definitions in (6.2) with $10,000$ samples of demand $G$ . The optimal value $f^{*}$ is approximated by the objective value found by a separated run of SFLS with sufficient iterations ( $400$ outer and $500$ inner iterations). We track these measures as a function of the number of iterations performed by each algorithm in the $x$ -axis. To indicate the values of $f_{0}(x)-f^{*}$ and $\max_{i=1,2,\dots,m}\{f_{i}({\mathbf{x}})-r_{i}\}$ corresponding to the high probability feasible solutions maintained at each SFLS (outer) iteration we use line markers in Figure 1. The YNW curves have no line markers as there are no outer iterations ensuring feasibility. SFLS finds a feasible solution quickly and maintain a relatively large constraint slack but YNW does not always ensure feasibility. SFLS also reduces the suboptimality of solutions faster, suggesting that SFLS is able to balance optimality and feasibility well on these instances.

6.3 Multi-class Neyman-Pearson classification

Another application that gives rise to (1) is Neyman-Pearson classification. In multi-class classification, there exist $m$ classes of data, where $\psi_{i}$ , $i=1,2,\dots,m$ , denotes a random variable defined using the distribution of data points associated with the $i$ -th class. To classify a data point $\psi_{i}$ to one of the $m$ classes, we rely on the same number of linear models ${\mathbf{x}}_{i}$ , $i=1,2,\dots,m$ . The predicted class for $\psi$ is $\argmax_{i=1,2,\dots,m}{\mathbf{x}}_{i}^{\top}\psi$ . High classification accuracy in this scheme requires ${\mathbf{x}}_{i}^{\top}\psi_{i}-{\mathbf{x}}_{l}^{\top}\psi_{i}$ with $i\neq l$ to be large and positive (Weston and Watkins 1998, Crammer and Singer 2002), that is, the classifiers have discriminatory power. Minimizing the expected loss $\mathbb{E}\left[\phi({\mathbf{x}}_{i}^{\top}\psi_{i}-{\mathbf{x}}_{l}^{\top}\psi_{i})\right]$ is one approach to promote this goal, where $\phi$ is a non-increasing convex loss function and $\mathbb{E}$ is expectation taken over $\psi_{i}$ .

Suppose misclassifying $\psi_{i}$ has a cost that depends on $i$ but not on the predicted class. We propose a model that prioritizes classes with relatively higher misclassification costs using constraints and simultaneously trains the set of $m$ linear models by solving

[TABLE]

where it is assumed (without loss of generality) that class $1$ has the highest misclassification cost and the value of $r_{i}$ is chosen to capture the misclassification cost of class $i$ . Here $\lambda$ is a regularization parameter. This formulation can be easily extended to handle the case where the mis-classification cost depends on both the true and predicted classes. Indeed, (18) is of the form (1). Infeasible solutions may result in large misclassification costs for some classes, which is undesirable, and creates a need for methods that emphasize feasibility.

We created test instances using the multi-class classification LIBSVM datasets connect-4, covtype, and news20 from Chang and Lin (2019). We selected these instances as their size still allows us to run DFLS in the manner discussed in §6.1. We summarize in Table 1 the number of classes, the number of data points in each class, and the number of features in these four datasets. We chose the loss function (18) to be the hinge loss $\phi(z)=(1-z)_{+}$ . Let $\psi_{i}$ follow the empirical distribution over the dataset of class $i$ for $i=1,2,\dots,m$ , which implies that all the expectations in (18) become finite-sample averages over data classes. We set the parameters $\lambda=5$ and $r_{i}=m-1$ for $i=2,\dots,m$ .

In all methods, the solution $\tilde{\mathbf{x}}=\mathbf{0}$ is used as the initial solution and it is feasible for (18). To apply SFLS and DFLS, we chose $T=100$ , $r^{(0)}=m$ , and $\theta=1.1$ across all datasets. Note that $r^{(0)}=m>m-1=f_{0}(\mathbf{0})\geq f^{*}$ for (18). In DFLS, we solve subproblems via standard subgradient descent method. In SFLS and DFLS, we choose step size $\gamma_{t}=0.05/\sqrt{t+1}$ for connect-4 and covtype and choose $\gamma_{t}=1/\sqrt{t+1}$ for news20. Both SFLS and YNW employed a mini-batch size of $1000$ to construct the stochastic gradients and the objective values. We chose the number of iterations in YNW so that its total number of data passes is $200$ for connect-4 and news20 and $100$ for covtype. Then, we also terminated SFLS and DFLS when the total data passes they performed exceed YNW.

Figure 2 displays the performance of each method. The $y$ -axes of the first row reports the term $f_{0}({\mathbf{x}})-f^{*}$ , that is, it focuses on optimality, while this axis in the second row shows the feasibility of solutions by plotting $\max_{i=1,2,\dots,m}\{f_{i}({\mathbf{x}})-r_{i}\}$ . We track these measures as a function of the number of equivalent data passes performed by each algorithm in the $x$ -axis, where a data pass involves going over the number of data points equal to the size of the training data. This is possible since the expectations in our instances are over discrete random variables. Tracking data passes allows us to assess algorithms in terms of data complexity. Similar to Figure 1, we uses line markers to indicate the values of $f_{0}(x)-f^{*}$ and $\max\limits_{i=1,2,\dots,m}\{f_{i}({\mathbf{x}})-r_{i}\}$ corresponding to the solutions maintained at each SFLS outer iteration, while YNW has no line marker since it does not maintain feasibility. Since DFLS needs two data passes in each inner iteration, it can only perform one or two outer iterations with the number of data passes in Figure 1. Hence, for a better visualization, we use line markers to also indicate the inner iterations of DFLS instead of only outer iterations. In this figure, $f^{*}$ is approximately by the objective value returned by DFLS after a sufficient number of data passes (i.e. at least $5000$ data passes with $2T$ inner iterations.)

On the connect-4 data set, SFLS maintains feasibility and reduces the optimaliy gap quite rapidly after a few data passes. Interestingly, despite providing an initial feasible solution, YNW decreases the optimality gap at the beginning by moving to a highly infeasible solution. The peformance of both methods on the covtype data are comparable. On the news20 data set, SFLS provides feasible solutions with smaller optimality gaps sooner than the benchmark method. The comparison of SFLS and YNW highlights the advantage of SFLS in terms of feasibility. Specifically, efficient methods that do not emphasize feasibility could lead to highly infeasible solutions if terminated prematurely (e.g., the connect-4 dataset).

DFLS also maintains a feasible solution path on all the datasets, as expected. However, its optimality gap reduces at a much slower rate with the number of data passes compared to SFLS because it uses deterministic subgradients based on the entire data set. These results thus underscore the importance of developing methods, such as SFLS, with low data complexity to balance optimality and feasibility.

6.4 Learning with Fairness Constraints

We consider learning a classifier with fairness constraints. Other examples include training predictive models with constraints on coverage rates, churn rates, and stability. Please see Goh et al. (2016) for further motivation and a non-convex formulation. Here we provide a convex formulation for these problems, which can be viewed as a tractable relaxation of the version in Goh et al. (2016) that admits the SOEC structure (1).

Suppose $({\mathbf{a}},b)$ is a data point from a distribution $\mathcal{D}$ , where ${\mathbf{a}}$ is a feature vector and $b\in\{1,-1\}$ is the class label. Let $\mathcal{D}_{M}$ and $\mathcal{D}_{F}$ denote two different distributions of features (that are not necessarily labeled), which may represent male and female individuals. The goal is to train a classifier ${\mathbf{a}}^{\top}{\mathbf{x}}$ that minimizes classification loss. The correct classification of data vector ${\mathbf{a}}$ implies that $b{\mathbf{a}}^{\top}{\mathbf{x}}>0$ . One can train such a classifier subject to fairness constraints by solving

[TABLE]

where $\lambda$ is a regularization parameter, $\kappa\in(0,1]$ is a constant, $\phi$ is a non-increasing loss function,

[TABLE]

and $\sigma({\mathbf{a}}^{\top}{\mathbf{x}})\in[0,1]$ represents the probability of the (random) classifier ${\mathbf{x}}$ predicting ${\mathbf{a}}$ as positive. Therefore, $\mathbb{E}_{{\mathbf{a}}\sim\mathcal{D}_{M}}[\sigma({\mathbf{a}}^{\top}{\mathbf{x}})]$ and $\mathbb{E}_{{\mathbf{a}}\sim\mathcal{D}_{M}}[\sigma({\mathbf{a}}^{\top}{\mathbf{x}})]$ represent the percentages of instances in $\mathcal{D}_{M}$ and $\mathcal{D}_{F}$ predicted as positive, respectively. The first constraint guarantees that the percentage of the positively predicted instances in $\mathcal{D}_{F}$ is at least a $\kappa$ fraction of that in $\mathcal{D}_{M}$ . The second constraint has similar interpretation. An analogous model was considered in Goh et al. (2016) but it involves non-convex constraints.

Observing that $\sigma({\mathbf{a}}^{\top}{\mathbf{x}})=1-\sigma(-{\mathbf{a}}^{\top}{\mathbf{x}})$ , we can reformulate the first constraint as $\mathbb{E}_{{\mathbf{a}}\sim\mathcal{D}_{M}}[\sigma({\mathbf{a}}^{\top}{\mathbf{x}})]+\mathbb{E}_{{\mathbf{a}}\sim\mathcal{D}_{F}}[\sigma(-{\mathbf{a}}^{\top}{\mathbf{x}})]/\kappa\leq 1/\kappa$ and approximate $\sigma$ by $\max\{0,0.5+z\}=(0.5+z)_{+}$ so that we obtain a convex constraint $\mathbb{E}_{{\mathbf{a}}\sim\mathcal{D}_{M}}[({\mathbf{a}}^{\top}{\mathbf{x}}+0.5)_{+}]+\mathbb{E}_{{\mathbf{a}}\sim\mathcal{D}_{F}}[(-{\mathbf{a}}^{\top}{\mathbf{x}}+0.5)_{+}]/\kappa\leq 1/\kappa$ . Applying an analogous convex approximation to the second constraint, we obtain the following convex formulation for training a classifier subject to fairness constraints:

[TABLE]

The left hand side of the first constraint will be large if the classifier ${\mathbf{x}}$ is not “fair”, that is, it makes ${\mathbf{a}}^{\top}{\mathbf{x}}$ very negative for most of ${\mathbf{a}}$ from $\mathcal{D}_{M}$ but very positive for most of ${\mathbf{a}}$ from $\mathcal{D}_{F}$ . Similarly, the left hand side of the second constraint will be large if the model ${\mathbf{x}}$ makes ${\mathbf{a}}^{\top}{\mathbf{x}}$ very positive for most of ${\mathbf{a}}$ from $\mathcal{D}_{F}$ but very negative for most of ${\mathbf{a}}$ from $\mathcal{D}_{M}$ . Choosing an appropriate $\kappa$ ensures that the obtained model is fair to both $\mathcal{D}_{M}$ and $\mathcal{D}_{F}$ . Indeed, a solution that violates constraints in this formulation translates to a classifier that discriminates against one of the two classes.

For testing, we considered the “a9a” dataset, also used by Goh et al. (2016) and another dataset dubbed “LoanStats” from Lending Club (2019). We chose $\lambda=5$ , $\kappa=0.95$ , and $\phi(z)=(1-z)_{+}$ in each case. The distributions $\mathcal{D}$ , $\mathcal{D}_{M}$ , and $\mathcal{D}_{F}$ were defined as empirical distributions based on each dataset as described below. The goal in the a9a dataset is to predict people making more than 50,000 USD. Following Goh et al. (2016), we used the 32,561 training instances ( $\mathcal{D}$ ) and the 16,281 testing instances in the dataset to construct the objective function and constraints, respectively. Since we need male and female subsets to construct constraints, we further split the testing data into 14,720 male instances ( $\mathcal{D}_{M}$ ) and 1,561 female instances ( $\mathcal{D}_{F}$ ). The LoanStats dataset contains information of $128,375$ loans issued in the fourth quarter of 2018 and the goal is to predict if a loan will be approved or rejected. After creating dummy variables, each loan is represented by a feature vector of 250 dimensions. We randomly partitioned the dataset into a set of $63,890$ loans ( $\mathcal{D}$ ) used to construct the objective function and a set of $64,485$ loans used to build the constraints. We further split the second set based on whether the feature “homeOwnership” equals “Mortgage” ( $\mathcal{D}_{M}$ ) or some other value ( $\mathcal{D}_{F}$ ) to obtain 31,966 and 32,519 loans in two subsets, respectively.

All methods are initialized at $\tilde{\mathbf{x}}=\mathbf{0}$ , which is feasible for (18). In SFLS and DFLS, we chose $r^{(0)}=1$ , $\gamma_{t}=0.1/\sqrt{t+1}$ and $\theta=1.1$ across all datasets. Note that $r^{(0)}=1=f_{0}(\mathbf{0})\geq f^{*}$ . In SFLS, we chose $T=300$ and $T=200$ for a9a and LoanStats datasets, respectively. In DFLS, we chose $T=100$ and $T=50$ for a9a and LoanStats datasets, respectively. Both SFLS and YNW employed a mini-batch size of $500$ and $1000$ for a9a and LoanStats datasets, respectively. Similar to §6.3, we chose the number of iterations in YNW so that its total number of data passes is 300. Then, we also terminated SFLS and DFLS when the total data passes they performed exceed $300$ .

Figure 3 displays the performance of SFLS, YNW, and DFLS as a function of data passes. The interpretation of the axes and line markers in this figure are analogous to the ones in Figure 2. In this figure, $f^{*}$ is approximately by the objective value returned by DFLS after a sufficient number of data passes (i.e. at least $5000$ data passes with $2T$ inner iterations.) On the a9a dataset, SFLS maintains a feasible solution path, as expected, while the YNW solutions are initially infeasible and become feasible with more data passes. The YNW reduces optimality gap more rapidly at the beginning while SFLS catches up quickly. The objective function value of YNW cannot be interpreted as an optimality gap when its solutions are infeasible since the corresponding objective function value can be super optimal. This feature is clearly visible on the LoanStats data. Here most of the YNW solutions are infeasible and superoptimal, that is, $f(x)-f^{*}$ is non-positive. The SFLS solution path continues to be feasible and suboptimal on this dataset, with its suboptimality decreasing consistently after each outer iterations. DFLS also produces a feasible path but does not effectively reduce the optimality gap because its data complexity is high, that is, it requires a large number of data passes to achieve a small optimality gap. Similar to §6.3, we once again find that the low data complexity of SFLS is critical to balance optimality and feasibility when solving an SOEC.

7 Conclusion

We consider constrained optimization models where both the objective function and multiple constraints contain expectations of random convex functions. These models, referred to as stochastic optimization problems with expectation constraints (SOECs), arise in several machine learning, engineering, and business applications. We develop a stochastic feasible level-set method (SFLS) to solve SOECs, propose a tractable oracle to be used with SLFS, and analyze related iteration complexities. SFLS’s total iteration complexity is comparable to stochastic subgradient methods in terms of $\epsilon$ but depends on a condition number – the cost of requiring feasibility. We evaluate the performance of SFLS across three applications involving approximate linear programming, multi-class classification, and learning classifiers with fairness constraints. We find that SFLS exhibits key advantages over existing methods. First, it ensures a feasible solution path with high probability while an existing state-of-the-art stochastic subgradient method can return highly infeasible solutions when terminated before conservative termination criteria are met. Infeasibilities may void the use of a solution in practice, especially if constraints model implementation requirements. Thus, the ability of SFLS to compute feasible solutions before convergence is practically relevant. Second, SFLS computes feasible solutions with small optimality gaps using only a few data passes owing to its low data-complexity, which is a desirable property when expectations are defined using large datasets that are expensive to scan. In contrast to SFLS, a recent deterministic feasible level set method exhibits high data complexity and large optimality gaps. Our theoretical and numerical findings bode well for the use of SFLS to solve SOECs and motivates further research into stochastic first order methods that emphasize feasibility.

8 Proofs of Theoretical Results

In this section, we provide the proofs of all technical results in the paper.

Proof 8.1

*Proof of Lemma 2.3: *** Since $r>f^{*}$ , it follows from Lemma 2.1(c) that $H(r)\leq 0$ . Therefore, since $\theta\geq 1$ we have $\epsilon\leq-\frac{\theta-1}{\theta+1}H(r)\leq-H(r)$ . Moreover, by Definition 2.2, we have $\mathcal{P}(r,\hat{\mathbf{x}})\leq H(r)+\epsilon$ with probability of at least $1-\delta$ , which implies that $\hat{x}$ is a feasible solution to (2) since $\mathcal{P}(r,\hat{\mathbf{x}})\leq H(r)+\epsilon\leq H(r)-H(r)\leq 0$ . \halmos

Proof of Theorem 2.5 depends on the following lemma.

Lemma 8.2

Given an input tuple $(r,\epsilon,\delta,\theta)$ , a stochastic oracle $\mathcal{A}(r,\epsilon,\delta)$ with $0<\epsilon\leq-\frac{\theta-1}{\theta+1}H(r)$ returns $U(r)$ and $\hat{\mathbf{x}}$ such that $\theta U(r)\leq H(r)\leq\mathcal{P}(r,\hat{\mathbf{x}})\leq U(r)/\theta$ with probability of at least $1-\delta$ .

Proof 8.3

Proof. The inequality $H(r)\leq\mathcal{P}(r,\hat{\mathbf{x}})$ holds by definition of $H(r)$ . By definition of stochastic oracle (Definition 2.2) and the property of $\epsilon$ , it follows that $\mathcal{P}(r,\hat{\mathbf{x}})\leq H(r)+\epsilon\leq\frac{2}{\theta+1}H(r)$ , $H(r)\leq U(r)+\epsilon\leq U(r)-\frac{\theta-1}{\theta+1}H(r)$ , and $U(r)\leq H(r)+\epsilon\leq\frac{2}{\theta+1}H(r)$ hold with probability of at least $1-\delta$ . Since $r>f^{*}$ , Lemma 2.1(c) implies that $H(r)\leq 0$ . Therefore, using the inequality $U(r)\leq\frac{2}{\theta+1}H(r)$ we get $U(r)\leq 0$ and $\theta U(r)\leq\frac{\theta+1}{2}U(r)\leq H(r)$ since $\theta>1$ . Finally, combining the inequalities $\mathcal{P}(r,\hat{\mathbf{x}})\leq\frac{2}{\theta+1}H(r)$ and $H(r)\leq U(r)-\frac{\theta-1}{\theta+1}H(r)$ (or equivalently $H(r)\leq\frac{\theta+1}{2\theta}U(r)$ ), we get $\mathcal{P}(r,\hat{x})\leq\frac{2}{\theta+1}\cdot\frac{\theta+1}{2\theta}U(r)=U(r)/\theta$ . \halmos

In the proof of Theorem 2.5 we need the following property of the condition measure $\beta$ . In particular, it can be easily verified from the convexity of $H(r)$ and $H(r)-\delta\leq H(r+\delta)\leq H(r)$ for any $\delta\geq 0$ (Lemma 2.3.5 in Nesterov 2004) that $\frac{H(r)}{r-f^{*}}$ is monotonically increasing in $r$ on $(f^{*},r^{(0)}]$ and

[TABLE]

Proof 8.4

Proof of Theorem 2.5:* We first show that the Algorithm 1 generates a feasible solution at each iteration with high probability. Let $K$ be the largest value of $k$ such that $r^{(k)}>f^{*}$ and the following inequality holds:*

[TABLE]

Notice that $K\geq 0$ since $0<\epsilon\leq 1\leq 2\theta^{2}$ and $H(r^{(0)})\leq 0$ . It follows from Lemma 8.2 that with a probability of at least $1-\delta^{(k)}$ we have,

[TABLE]

Since $r^{(k+1)}=r^{(k)}+U(r^{(k)})/(2\theta)$ , we have

[TABLE]

and

[TABLE]

with a probability of at least $1-\delta^{(k)}$ , where the last inequalities in both (23) and (24) follow from (20). Inequality (23) and the condition $r^{(k)}>f^{*}$ imply that $r^{(k+1)}>f^{*}$ . Applying this argument recurrently and using the fact that $\sum_{k=0}^{\infty}\delta^{(k)}=\delta$ , we have (23), (24) and $r^{(k+1)}>f^{*}$ holds for $k=0,1,\ldots,K$ . Therefore, since $\epsilon_{\mathcal{A}}\leq-\frac{\theta-1}{\theta+1}H(r^{(k)})\leq-H(r^{(k)})$ for $k=0,1,\ldots,K$ , Lemma 2.3 implies the solution ${\mathbf{x}}^{(k)}$ generated at iteration $k=0,1,\ldots,K$ is feasible to (1) with a probability of at least $1-\delta$ . We next show that (21) holds with a high probability until Algorithm 1 terminates. By the definition of $K$ , we know that (21) is violated when $k=K+1$ , i.e. $-\frac{\theta-1}{2\theta^{2}(\theta+1)}H(r^{(0)})\epsilon>-\frac{\theta-1}{\theta+1}H(r^{(K+1)})$ . Since $r^{(k+1)}\leq r^{(k)}$ and $\frac{H(r)}{r-f^{*}}$ is monotonically increasing, we can show that

[TABLE]

where the last inequality holds by (23). Using the definition of $\epsilon_{\text{opt}}$ , (25), and (22) for $k=K$ (specifically, $H(r^{(K)})\leq U(r^{(K)})/\theta$ ), we have

[TABLE]

which indicates that Algorithm 1 must stop before $k=K+1$ . Therefore, SFLS generates a feasible solution with a probability of at least $1-\delta$ at each iteration before termination.

We now proceed to establish that the terminal solution of SFLS is relative $\epsilon$ -optimal solution. By definition of $\mathcal{P}(r^{(k)},{\mathbf{x}}^{(k)})$ and (22) it follows that $f_{0}({\mathbf{x}}^{(k)})-r^{(k)}\leq\mathcal{P}(r^{(k)},{\mathbf{x}}^{(k)})\leq H(r^{(k)})/\theta^{2}\leq 0$ for all $k$ . Hence,

[TABLE]

Combining (26) and $r^{(k)}-f^{*}\leq(r^{(0)}-f^{*})H(r^{(k)})/H(r^{(0)})$ derived from (20) stipulates that with a probability of at least $1-\delta$ :

[TABLE]

where we used (22) in the second inequality. Hence, at termination of Algorithm 1 we get $\frac{f_{0}({\mathbf{x}}^{(k)})-f^{*}}{r^{(0)}-f^{*}}\leq\epsilon$ since the algorithm stops when $\theta U(r^{(k)})\geq H(r^{(0)})\epsilon$ .

Finally we show that $K:=\dfrac{2\theta^{2}}{\beta}\ln\left(\dfrac{\theta^{2}}{\beta\epsilon}\right)$ . By recursively applying inequality (24) we get

[TABLE]

with probability of at least $1-\delta$ , which implies $r^{(K)}-f^{*}\leq-\frac{H(r^{(0)})\epsilon}{\theta^{2}}$ for the choice of $K$ . Hence, we have $-U(r^{(K)})\leq-\theta H(r^{(K)})\leq\theta(r^{(K)}-f^{*})\leq-\epsilon H(r^{(0)})/\theta$ where the first inequality follows by (22), the second by (20), and the third by (27). This indicates that the stopping criterion of Algorithm 1 holds with a probability of at least $1-\delta$ when $k=K$ and SFLS requires at most $K$ calls to oracle $\mathcal{A}$ . \halmos

Proof 8.5

Proof of Proposition 3.1:* The proof of the first part directly follows from Proposition 3.2 in Nemirovski et al. 2009. We only show that SMD is a valid oracle. It is straightforward to see that the inequality $U(\bar{x}^{(t)})-L(\bar{y}^{(t)})\leq\epsilon_{\mathcal{A}}$ implies $\mathcal{P}(r,\bar{x}^{(t)})-H(r)\leq U(\bar{x}^{(t)})-H(r)\leq U(\bar{x}^{(t)})-L(\bar{y}^{(t)})\leq\epsilon_{\mathcal{A}}$ , where the first inequality holds since $U(\bar{x}^{(t)})$ is an upper bound on $\mathcal{P}(r,\bar{x}^{(t)})$ and the second since $L(\bar{y}^{(t)})$ is a lower bound on $H(r)$ . This indicates that the conditions provided in Definition 2.2 are satisfied. \halmos*

To show part (i) of Proposition 4.1, we use known lemmas 8.6 and 8.7 as well as prove lemmas 8.8 and 8.10. To prove part (ii) of this proposition we need Lemma 8.12. Before stating these lemmas, we present some required notation and representations, which we present next. We denote the diameter of $\mathcal{Z}$ with respect to $\omega_{z}$ by

[TABLE]

In addition, for any ${\bm{\zeta}}_{x}\in\mathbb{R}^{d}$ , ${\bm{\zeta}}_{y}\in\mathbb{R}^{m+1}$ , ${\mathbf{x}}^{\prime}\in\mathcal{X}^{o}$ , ${\mathbf{y}}^{\prime}\in\mathcal{Y}^{o}$ , and ${\mathbf{z}}^{\prime}=({\mathbf{x}}^{\prime},{\mathbf{y}}^{\prime})\in\mathcal{Z}^{o}$ , it is easy to verify for ${\bm{\zeta}}=({\bm{\zeta}}_{x},{\bm{\zeta}}_{y})$ that

[TABLE]

where $P_{{\mathbf{x}}^{\prime}}^{x}({\bm{\zeta}}_{x}):=\argmin_{{\mathbf{x}}\in\mathcal{X}}\{{\bm{\zeta}}_{x}^{\top}({\mathbf{x}}-{\mathbf{x}}^{\prime})+V_{x}({\mathbf{x}}^{\prime},{\mathbf{x}})\}$ and $P_{{\mathbf{y}}^{\prime}}^{y}({\bm{\zeta}}_{y}):=\argmin_{{\mathbf{y}}\in\mathcal{Y}}\{{\bm{\zeta}}_{y}^{\top}({\mathbf{y}}-{\mathbf{y}}^{\prime})+V_{y}({\mathbf{y}}^{\prime},{\mathbf{y}})\}$ .

Lemma 8.6 (Equation (2.37) and Lemma 6.1 in Nemirovski et al. 2009)

Let ${\bm{\zeta}}_{x}^{(t)}\in\mathbb{R}^{d}$ , $t=0,1,2,\ldots$ be a set of random variables, ${\mathbf{v}}^{(0)}\in\mathcal{X}^{o}$ and ${\mathbf{v}}^{(t+1)}=P^{x}_{{\mathbf{v}}^{(t)}}({\bm{\zeta}}_{x}^{(t)})$ for $t=0,1,2,\ldots$ . For any ${\mathbf{v}}\in\mathcal{X}$ and $t\geq 1$ , we have

[TABLE] 2. 2.

Let ${\bm{\zeta}}_{y}^{(t)}\in\mathbb{R}^{m+1}$ , $t=0,1,2,\ldots$ be a set of random variables, ${\mathbf{v}}^{(0)}\in\mathcal{Y}^{o}$ and ${\mathbf{v}}^{(t+1)}=P^{y}_{{\mathbf{v}}^{(t)}}({\bm{\zeta}}_{y}^{(t)})$ for $t=0,1,2,\ldots$ . For any ${\mathbf{v}}\in\mathcal{Y}$ and $t\geq 1$ , we have

[TABLE] 3. 3.

Let ${\bm{\zeta}}^{(t)}\in\mathbb{R}^{d+m+1}$ , $t=0,1,2,\ldots$ be a set of random variables, ${\mathbf{v}}^{(0)}\in\mathcal{Z}^{o}$ and ${\mathbf{v}}^{(t+1)}=P_{{\mathbf{v}}^{(t)}}({\bm{\zeta}}^{(t)})$ for $t=0,1,2,\ldots$ . For any ${\mathbf{v}}\in\mathcal{Z}$ and $t\geq 1$ , we have

[TABLE]

Lemma 8.7 (Lemma 2 in Lan et al. 2012)

Let ${\bm{\xi}}^{(t)}$ and $\sigma_{t}>0$ for $t=0,1,2,\ldots$ be respectively a sequence of i.i.d. random variables and deterministic numbers; ${\bm{\xi}}^{[t]}=({\bm{\xi}}^{(0)},{\bm{\xi}}^{(1)},\dots,{\bm{\xi}}^{(t)})$ ; $\mathbb{E}_{t}$ the conditional expectation conditioning on ${\bm{\xi}}^{[t-1]}$ for $t\geq 1$ ; and $\psi_{t}({\bm{\xi}}^{[t]})$ be a measurable function of ${\bm{\xi}}^{[t]}$ such that either

Case A:* $\mathbb{E}_{t}\left[\psi_{t}\left({\bm{\xi}}^{[t]}\right)\right]=0$ and $\mathbb{E}_{t}\left[\exp\left(\psi_{t}\left({\bm{\xi}}^{[t]}\right)^{2}/\sigma_{t}^{2}\right)\right]\leq\exp(1)$ , or*
Case B:* $\mathbb{E}_{t}\left[\exp\left(\left|\psi_{t}\left({\bm{\xi}}^{[t]}\right)\right|/\sigma_{t}\right)\right]\leq\exp(1)$ ,*

*almost surely for all $t$ . Then for any $\Omega>0$ , we have the followings:

In case A:*

[TABLE]

In case B:

[TABLE]

where $\sigma^{[t]}=(\sigma_{0},\sigma_{1},\dots,\sigma_{t})^{\top}$ .

Lemma 8.8 shows that the stochastic subgradient $G(\cdot,\cdot,\cdot)$ has a light-tailed distribution and bounds the Bregmann distances. Define

[TABLE]

Lemma 8.8

The following inequalities hold:

[TABLE]

Moreover, when ${\mathbf{z}}^{\prime}=({\mathbf{x}}^{\prime},{\mathbf{y}}^{\prime}):=\argmin_{{\mathbf{z}}\in\mathcal{Z}}\omega_{z}({\mathbf{z}})$ , we have

[TABLE]

Proof 8.9

Proof. Applying Jensen’s inequality and using the definitions of $\|\cdot\|_{*,z}$ , $M$ , and the inequalities (8) and (9), we have

[TABLE]

Using (36) and Jensen’s inequality, it follows that

[TABLE]

Hence, we have

[TABLE]

where the first inequality follows from the definition of $\Delta_{t}$ and the inequality $\|a+b\|^{2}\leq 2a^{2}+2b^{2}$ for any $a,b\in\mathbb{R}$ , the second from (37), the third from Jensen’s inequality for concave functions, and the fourth by inequalities (8) and (9). Following a similar argument, we can also show that $\mathbb{E}_{t}\left[\exp\left(\|\Delta_{t}^{x}\|_{*,x}^{2}/(2M_{x})^{2}\right)\right]\leq\exp(1)$ and $\mathbb{E}_{t}\left[\exp\left(\|\Delta_{t}^{y}\|_{*,y}^{2}/(2M_{y})^{2}\right)\right]\leq\exp(1)$ . Finally, inequalities (33), (34), and (35) follow because $\omega_{x}$ , $\omega_{y}$ and $\omega_{z}$ are modulus $\alpha_{x}$ , $\alpha_{y}$ and $1$ , respectively. \halmos

Lemma 8.10

Let $\nu_{s,t}:=\dfrac{\gamma_{s}}{\sum_{s^{\prime}=0}^{t}\gamma_{s^{\prime}}}$ . Given $\Omega>0$ , Algorithm 3 computes $({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)})$ , $t=1,2,3,\ldots,$ such that

[TABLE]

Proof 8.11

Proof.Since ${\mathbf{z}}^{(0)}\in\argmin_{{\mathbf{z}}\in\mathcal{Z}}\omega_{z}({\mathbf{z}})$ and ${\mathbf{z}}^{(t+1)}=P_{{\mathbf{z}}^{(t)}}(\gamma_{t}G({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)},{\bm{\xi}}^{(t)}))$ in Algorithm 3, by Lemma 8.6 we have, for any ${\mathbf{z}}\in\mathcal{Z}$ ,

[TABLE]

where the second inequality follows by (35). In addition, by definition of $\Delta_{t}$ , for any ${\mathbf{z}}\in\mathcal{Z}$ we have

[TABLE]

Applying (8.11) to (41) and reorganizing terms lead to

[TABLE]

Maximizing both sides of the above inequality over ${\mathbf{z}}\in\mathcal{Z}$ implies

[TABLE]

Let ${\mathbf{v}}^{(0)}={\mathbf{z}}^{(0)}$ and ${\mathbf{v}}^{(t+1)}=P_{{\mathbf{v}}^{(t)}}(-\gamma_{t}\Delta_{t})$ for $t=0,1,2,\ldots$ . From Lemma 8.6 it follows that for any ${\mathbf{z}}\in\mathcal{Z}$ ,

[TABLE]

Rewriting ${\mathbf{z}}-{\mathbf{z}}^{(s)}={\mathbf{v}}^{(s)}-{\mathbf{z}}^{(s)}+{\mathbf{z}}-{\mathbf{v}}^{(s)}$ and applying (43) to (42) yield

[TABLE]

We next find a probabilistic bound for the right hand side of the above inequality.

Bound on $\frac{\sum_{s=0}^{t}\gamma_{s}({\mathbf{v}}^{(s)}-{\mathbf{z}}^{(s)})^{\top}\Delta_{s}}{\sum_{s=0}^{t}\gamma_{s}}$ :* By our choice of ${\mathbf{z}}^{(0)}$ , i.e. ${\mathbf{z}}^{(0)}=\argmin_{{\mathbf{z}}\in\mathcal{Z}}\omega_{z}({\mathbf{z}})$ and (35), for any $s=0,1,\ldots,t$ we have*

[TABLE]

Define $\psi_{s}:=\nu_{s,t}({\mathbf{v}}^{(s)}-{\mathbf{z}}^{(s)})^{\top}\Delta_{s}$ and $\sigma_{s}:=4\sqrt{2}M\nu_{s,t}$ . Because ${\bm{\xi}}^{(s)}$ is independent of ${\mathbf{v}}^{(s)}$ and ${\mathbf{z}}^{(s)}$ , we have $\mathbb{E}_{s}[\psi_{s}]=0$ . In addition, it can be verified that $\psi_{s}^{2}\leq\nu_{s,t}^{2}\|{\mathbf{v}}^{(s)}-{\mathbf{z}}^{(s)}\|_{z}^{2}\|\Delta_{s}\|_{*,z}^{2}\leq 8\nu_{s,t}^{2}\|\Delta_{s}\|_{*,z}^{2},$ where the second inequality holds by (45). Using this inequality and (38), we get $\mathbb{E}_{s}\left[\exp\left(\psi_{s}^{2}/\sigma_{s}^{2}\right)\right]\leq\mathbb{E}_{s}\left[\exp\left(\|\Delta_{s}\|_{*,z}^{2}/(2M)^{2}\right)\right]\leq\exp(1)$ . Hence, it follows from Case A in Lemma 8.7 that

[TABLE]

Bound on $\frac{\sum_{s=0}^{t}\gamma_{s}^{2}\left(\left\|G({\mathbf{x}}^{(s)},{\mathbf{y}}^{(s)},{\bm{\xi}}^{(s)})\right\|_{*,z}^{2}+\left\|\Delta_{s}\right\|_{*,z}^{2}\right)}{\sum_{s=0}^{t}\gamma_{s}}$ :* Define $\psi_{s}:=\gamma_{s}\nu_{s,t}\big{(}\left\|G({\mathbf{x}}^{(s)},{\mathbf{y}}^{(s)},{\bm{\xi}}^{(s)})\right\|_{*,z}^{2}+\left\|\Delta_{s}\right\|_{*,z}^{2}\big{)}$ and $\sigma_{s}:=5M^{2}\gamma_{s}\nu_{s,t}$ . We then have*

[TABLE]

where the first inequality is from Jensen’s inequality and the second inequality is from (36) and (38). Hence, from Case B in Lemma 8.7 it follows that

[TABLE]

The conclusion is hence obtained by upper bounding the right hand size of (44) using the union bound of (46) and (47). \halmos

Lemma 8.12

Let $\nu_{s,t}:=\dfrac{\gamma_{s}}{\sum_{s^{\prime}=0}^{t}\gamma_{s^{\prime}}}$ . Given $\Omega>0$ , Algorithm 3 guarantees that

[TABLE]

and

[TABLE]

Proof 8.13

Proof. Since the proofs of (48) and (49) are very similar, we will only prove (48). Let

[TABLE]

and

[TABLE]

Define $\delta_{t}:=\Phi({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)},{\bm{\xi}}^{(t)})-\phi({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)})$ . Using this definition and those of $l^{(t)}_{*}=\min_{{\mathbf{x}}\in\mathcal{X}}l^{(t)}({\mathbf{x}})$ , $\hat{l}^{(t)}_{*}=\min_{{\mathbf{x}}\in\mathcal{X}}\hat{l}^{(t)}({\mathbf{x}})$ , and $\Delta_{t}$ we have

[TABLE]

By (28) and line 5 of Algorithm 2, we have ${\mathbf{x}}^{(t+1)}=P^{x}_{{\mathbf{x}}^{(t)}}(2D_{x}^{2}\gamma_{t}G_{x}({\mathbf{x}}^{(t)},{\mathbf{y}}^{(t)},{\bm{\xi}}^{(t)}))$ . Let ${\mathbf{w}}^{(0)}={\mathbf{v}}^{(0)}={\mathbf{x}}^{(0)}$ , ${\mathbf{w}}^{(t+1)}:=P^{x}_{{\mathbf{w}}^{(t)}}(-2D_{x}^{2}\gamma_{t}\Delta_{t}^{x})$ and ${\mathbf{v}}^{(t+1)}:=P^{x}_{{\mathbf{v}}^{(t)}}(2D_{x}^{2}\gamma_{t}\Delta_{t}^{x})$ for $t=0,1,2,\ldots$ . From Lemma 8.6 and (33) it follows that

[TABLE]

Writing ${\mathbf{x}}-{\mathbf{x}}^{(s)}={\mathbf{x}}-{\mathbf{w}}^{(s)}+{\mathbf{w}}^{(s)}-{\mathbf{x}}^{(s)}$ and ${\mathbf{x}}^{(s)}-{\mathbf{x}}={\mathbf{v}}^{(s)}-{\mathbf{x}}+{\mathbf{x}}^{(s)}-{\mathbf{v}}^{(s)}$ , these two inequalities imply

[TABLE]

Hence,

[TABLE]

Applying (51) in (50), we get

[TABLE]

We next find a probabilistic bound for the right hand side of the above inequality.

Bounds on $\left|\dfrac{\sum_{s=0}^{t}\gamma_{s}({\mathbf{w}}^{(s)}-{\mathbf{x}}^{(s)})^{\top}\Delta_{s}^{x}}{\sum_{s=0}^{t}\gamma_{s}}\right|$ and $\left|\dfrac{\sum_{s=0}^{t}\gamma_{s}({\mathbf{x}}^{(s)}-{\mathbf{v}}^{(s)})^{\top}\Delta_{s}^{x}}{\sum_{s=0}^{t}\gamma_{s}}\right|$ :* The inequality (33) indicates that*

[TABLE]

Define $\psi_{s}:=\nu_{s,t}({\mathbf{w}}^{(s)}-{\mathbf{x}}^{(s)})^{\top}\Delta_{s}^{x}$ and $\sigma_{s}:=\dfrac{4\sqrt{2}D_{x}M_{x}\nu_{s,t}}{\sqrt{\alpha_{x}}}$ . Since ${\bm{\xi}}^{(s)}$ is independent of ${\mathbf{w}}^{(s)}$ and ${\mathbf{x}}^{(s)}$ , we have $\mathbb{E}_{s}[\psi_{s}]=0$ . Furthermore,

[TABLE]

where the second inequality follows from (53). Using the definition of $\delta_{s}$ , (54), and (31)-(32), it follows that $\mathbb{E}_{s}[\exp(\psi_{s}^{2}/\sigma_{s}^{2})]\leq\exp(1)$ . Hence Case A in Lemma 8.7 and union bound we get

[TABLE]

With a similar argument, we can also show

[TABLE]

Bound on $\dfrac{\sum_{s=0}^{t}\gamma_{s}^{2}\|\Delta_{s}^{x}\|_{*,x}^{2}}{\sum_{s=0}^{t}\gamma_{s}}$ :* Define $\psi_{s}:=\gamma_{s}\nu_{s,t}\|\Delta_{s}^{x}\|_{*,x}^{2}$ and $\sigma_{s}:=4M_{x}^{2}\gamma_{s}\nu_{s,t}$ . Using (31)-(32), it is easy to verify that $\mathbb{E}_{s}\left[\exp\left(|\psi_{s}|/\sigma_{s}\right)\right]\leq\exp(1)$ . Hence, from Case B in Lemma 8.7 we have*

[TABLE]

Bound on $\left|\dfrac{\sum_{s=0}^{t}\gamma_{s}\delta_{s}}{\sum_{s=0}^{t}\gamma_{s}}\right|$ :* From definition of $\delta_{s}$ and (10), it follows that*

[TABLE]

Hence by Case A in Lemma 8.7 and union bound we get

[TABLE]

The conclusion can be then obtained by upper bounding the right hand side of (52) using the union bound of (55), (56), (57), and (58). \halmos

Proof 8.14

Proof of Proposition 4.1:* (i) The definition of $\Omega(\delta)$ in (12) guarantees $\exp(-\Omega(\delta)^{2}/3)+\exp(-\Omega(\delta)^{2}/12)+\exp(-3\Omega(\delta)/4)\leq\delta$ . Recall that $\nu_{s,t}=\dfrac{\gamma_{s}}{\sum_{s^{\prime}=0}^{t}\gamma_{s^{\prime}}}$ . With $\gamma_{s}=\dfrac{1}{M\sqrt{s+1}}$ , it is straightforward to verify the following inequalities:*

[TABLE]

Applying these four inequalities to bound the terms in (39), we get

[TABLE]

Given $\epsilon_{\mathcal{A}}>0$ , let $\epsilon^{\prime}:=\epsilon_{\mathcal{A}}/\left(10\Omega(\delta)M+4.5M\right)$ . When $t\geq\max\left\{6,\left(\dfrac{8\ln(4/\epsilon^{\prime})}{\epsilon^{\prime}}\right)^{2}-2\right\}$ , we have $\dfrac{1+\ln(t+1)}{2\sqrt{t+2}-2}\leq\dfrac{2\ln(t+2)}{\sqrt{t+2}}$ and $\dfrac{2\ln(t+2)}{\sqrt{t+2}}$ is monotonically decreasing in $t$ . Hence

[TABLE]

*Using the above inequality in (63) we get $\text{Prob}\left\{u^{(t)}_{*}-l^{(t)}_{*}>\left(10\Omega(\delta)M+4.5M\right)\epsilon^{\prime}=\epsilon_{\mathcal{A}}\right\}\leq\delta$ which completes the proof.

(ii) We only prove this corollary for the lower bounds as the proof of upper bounds is similar. The choice of $\Omega(\delta)$ guarantees $6\exp(-\Omega(\delta)^{2}/3)+\exp(-\Omega(\delta)^{2}/12)+\exp(-3\Omega(\delta)/4)\leq\delta$ . Recall that $\nu_{s,t}=\dfrac{\gamma_{s}}{\sum_{s^{\prime}=0}^{t}\gamma_{s^{\prime}}}$ . Since $\gamma_{s}=\dfrac{1}{M\sqrt{s+1}}$ , the inequalities (59), (60), (61), and (62) hold. Applying these four inequalities to (48) yields

[TABLE]

where the first inequality follows from $M^{2}\geq\dfrac{2D_{x}^{2}M_{x}^{2}}{\alpha_{x}}$ by (11).

Let $\epsilon^{\prime}:=\epsilon_{\mathcal{A}}/\left(\Omega(\delta)Q+8\Omega(\delta)M+2.5M\right)$ . When $t\geq\max\left\{6,\left(\dfrac{8\ln(4/\epsilon^{\prime})}{\epsilon^{\prime}}\right)^{2}-2\right\}$ , the inequality (64) holds which can be applied to (65) to show that

[TABLE]

\halmos

Proof 8.15

Proof of Theorem 4.2.* We begin by establishing that the following inequalities hold with high probability in at most $T(\delta,\epsilon)$ number of iterations:*

[TABLE]

Given $\Omega(\delta)$ , parts (i) and (ii) of Proposition 4.1 imply that when

[TABLE]

we have $\text{Prob}\left\{u^{(t)}_{*}-l^{(t)}_{*}>\dfrac{\epsilon_{\mathcal{A}}}{2}\right\}\leq\delta/3$ , $\text{Prob}\left\{\left|\hat{l}^{(t)}_{*}-l^{(t)}_{*}\right|>\dfrac{\epsilon_{\mathcal{A}}}{2}\right\}\leq\delta/3$ , and $\text{Prob}\Big{\{}\Big{|}\hat{u}^{(t)}_{*}-u^{(t)}_{*}\Big{|}>\dfrac{\epsilon_{\mathcal{A}}}{2}\Big{\}}\leq\delta/3$ . Hence, using union bounds we get

[TABLE]

To complete the proof we show that (66) implies $\mathcal{P}(r,\bar{\mathbf{x}}^{(t)})-H(r)\leq\epsilon_{\mathcal{A}}$ and $\left|\hat{u}_{*}^{(t)}-H(r)\right|\leq\epsilon_{\mathcal{A}}$ . First note that we have

[TABLE]

where the first inequality follows from (13), the second from (66), and the third holds since $l^{(t)}_{*}$ is a lower bound on $H(r)$ . Using (66) and $u^{(t)}_{*}\leq H(r)+\dfrac{\epsilon_{\mathcal{A}}}{2}$ , we get

[TABLE]

In addition,

[TABLE]

where the first inequality holds by (66) and the second since $u^{(t)}_{*}$ is an upper bound on $H(r)$ . The inequalities (67)-(69) complete the proof. \halmos

Proof 8.16

Proof of Corollary 5.1:* The proof directly follows from theorems 2.5 and 4.2 and definition of $\beta$ . \halmos*

Lemma 8.17 below shows the number of iterations required by Algorithm 4 to find the upper bound $\bar{U}$ on $H(r^{(0)})$ .

Lemma 8.17

Given an input tuple $(r^{(0)},\bar{\alpha},\delta,\gamma_{t},\theta)$ , Algorithm 4 terminates with probability of at least $1-\delta$ after at most

[TABLE]

OVSMD calls and

[TABLE]

gradient iterations. In addition, $H(r^{(0)})\leq\bar{U}<0$ and ${|H(r^{(0)})|}/{|\bar{U}|}\leq\theta$ hold at termination.

Note that we use the $\tilde{\mathcal{O}}$ complexity notation, which omits logarithmic terms, to simplify the expression for the gradient iteration complexity.

Proof 8.18

Proof. We first prove that Algorithm 4 terminates with a probability of at least $1-\delta$ . Consider the $h$ th iteration of this algorithm. Given $\hat{u}_{*}^{(h)}$ returned by OVSMD, Theorem 4.2 guarantees with a probability of at least $1-\delta^{(h)}$ that

[TABLE]

Since $\sum_{h=0}^{\infty}\delta^{(h)}=\delta$ , using union bound it is clear that (70) holds for $h=0,1,2,\dots,$ with a probability of at least $1-\delta$ . In addition, (70) implies that $\hat{u}_{*}^{(h)}+\alpha^{(h)}\leq H(r^{(0)})+2\alpha^{(h)}\leq 0$ when $\alpha^{(h)}\leq-H(r^{(0)})/2$ . Furthermore, when $\alpha^{(h)}\leq-\frac{\theta-1}{2\theta}H(r^{(0)})$ (which also indicates that $\alpha^{(h)}\leq-\frac{H(r^{(0)})}{2}$ since $\theta>1$ and $H(r^{(0)})\leq 0$ ), we have

[TABLE]

where the first inequality follows from the inequality $-H(r^{(0)})\leq-\hat{u}_{*}^{(h)}+\alpha^{(h)}$ and the fact that the function $x/(x-2\alpha^{(h)})$ is a decreasing function in $x$ . (71) indicates that as soon as $\alpha^{(h)}\leq-\frac{\theta-1}{2\theta}H(r^{(0)})$ , the stopping criteria of Algorithm 4 hold and the algorithm terminates with a probability of $1-\delta$ . Since $\alpha^{(h)}=\alpha^{(0)}/2^{h}=\bar{\alpha}/2^{h}$ and $\beta=\Omega(|H(r^{(0)})|)$ , the inequality $\alpha^{(h)}\leq-\frac{\theta-1}{2\theta}H(r^{(0)})$ can be guaranteed in at most $J:=\log_{2}\left(\dfrac{2\theta\bar{\alpha}}{(\theta-1)|H(r^{(0)})|}\right)=\mathcal{O}\left(\ln\left(\frac{\theta}{(\theta-1)\beta}\right)\right)$ iterations. Furthermore, the inequalities (70) and (71) imply that at terminatopm $\bar{U}=\hat{u}_{*}^{(J)}+\alpha^{(J)}<0$ and $\dfrac{|H(r^{(0)})|}{|\bar{U}|}\leq\dfrac{\hat{u}_{*}^{(J)}-\alpha^{(J)}}{\hat{u}_{*}^{(J)}+\alpha^{(J)}}\leq\theta$ .

We next compute the total number of gradient iterations taken by Algorithm 4. Notice that by Theorem 4.2, the $h$ -th call of OVSMD requires at most $T(\delta^{(h)},\alpha^{(h)})$ iterations. Therefore, the total number of iterations can be computed as

[TABLE]

where we used $\Omega(\delta^{(h)})=\mathcal{O}\left(h\log\left(\dfrac{1}{\delta}\right)\right)$ and $\alpha^{(h)}=\frac{\bar{\alpha}}{2^{h}}$ in the second inequality, $J=\mathcal{O}\left(\log_{2}\left(\dfrac{\bar{\alpha}}{|H(r^{(0)})|}\right)\right)$ in the third and fourth equations, and $|H(r^{(0)})|=\Theta\left(\beta\right)$ . \halmos

Proof 8.19

Proof of Theorem 5.2:* The proof of this theorem is a direct result of Corollary 5.1 and Lemma 8.17. In particular, it is straightforward to see that the total number of OVSMD calls is*

[TABLE]

In addition, the total number of gradient iterations can be computed as

[TABLE]

\halmos

Proof 8.20

Proof of Corollary 5.3:* Let $\delta^{(k)}=\dfrac{\delta}{2^{k}}$ for $k\geq 0$ as defined in SFLS. With a little abuse of notation, we use $\Omega(n)$ to represent a quantity whose order of magnitude is at least $n$ . According to Theorem 4.2, for any $\delta\in(0,1)$ and $K\geq 0$ , there exists $\epsilon_{\mathcal{A}}$ satisfying $\Omega\left(\frac{\ln(1/\delta^{K})}{\sqrt{T}}\right)\leq\epsilon_{\mathcal{A}}\leq\mathcal{O}\left(\frac{\ln(1/\delta^{K})\ln(T)}{\sqrt{T}}\right)$ such that OVSMD is a valid stochastic oracle $\mathcal{A}\left(r^{(k)},\epsilon_{\mathcal{A}},\delta^{(k)}\right)$ for iteration $k=0,1,\dots,K$ of SFLS. Let $\epsilon=-\frac{2\theta^{2}(\theta+1)}{(\theta-1)H(r^{(0)})}\epsilon_{\mathcal{A}}$ such that $\Omega\left(\frac{K\theta^{2}\ln(1/\delta)}{\sqrt{T}}\right)\leq\epsilon\leq\mathcal{O}\left(\frac{K\theta^{2}\ln(1/\delta)\ln(T)}{\sqrt{T}}\right)$ . Hence, there exists $K=\mathcal{O}\left(\frac{\theta^{2}}{\beta}\ln\left(\frac{T}{\beta}\right)\right)$ such that $K\geq\frac{2\theta^{2}}{\beta}\ln\left(\dfrac{\theta^{2}}{\beta\epsilon}\right)$ . With such $K$ and $\epsilon$ , according to Theorem 2.5, SFLS generates a feasible solution at iteration $k=0,1,\dots,K$ and finds a relative $\epsilon$ -optimal and feasible solution with $\epsilon\leq\mathcal{O}\left(\frac{\theta^{4}\ln(1/\delta)\ln(T)\ln\left(T/\beta\right)}{\beta\sqrt{T}}\right)$ with a probability of at least $1-\delta$ in at most $K$ outer iterations (calls of OVSMD), which corresponds to $KT=\mathcal{O}\left(\frac{\theta^{2}T}{\beta}\ln\left(\frac{T}{\beta}\right)\right)$ gradient iterations. \halmos*

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abdelaziz (2012) Abdelaziz FB (2012) Solution approaches for the multiobjective stochastic programming. European Journal of Operational Research 216(1):1–16.
2Abdelaziz et al. (2007) Abdelaziz FB, Aouni B, El Fayedh R (2007) Multi-objective stochastic programming for portfolio selection. European Journal of Operational Research 177(3):1811–1823.
3Adelman and Mersereau (2008) Adelman D, Mersereau A (2008) Relaxations of weakly coupled stochastic dynamic programs. Operations Research 56(3):712–727.
4Adelman and Mersereau (2013) Adelman D, Mersereau AJ (2013) Dynamic capacity allocation to customers who remember past service. Management Science 59(3):592–612.
5Allen-Zhu (2017) Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. Proceedings of the 49th Annual ACM Symposium on Theory of Computing , STOC ’17.
6Aravkin et al. (2019) Aravkin AY, Burke JV, Drusvyatskiy D, Friedlander MP, Roy S (2019) Level-set methods for convex optimization. Mathematical Programming 174(1-2):359–390.
7Azaron et al. (2008) Azaron A, Brown K, Tarim S, Modarres M (2008) A multi-objective stochastic programming approach for supply chain design considering risk. International Journal of Production Economics 116(1):129–138.
8Bach and Moulines (2013) Bach FR, Moulines E (2013) Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Advances in Neural Information Processing Systems (NIPS) , 773–781.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

1 Introduction

2 Stochastic Feasible Level-set Method

Lemma 2.1

Definition 2.2** **(Stochastic Oracle)

Lemma 2.3

Definition 2.4** **(Input tuple)

Theorem 2.5

3 Idealized Stochastic Oracle

3.1 Stochastic Mirror Descent

3.2 Validity of Stochastic Oracle and Computational Issues

Proposition 3.1

4 Tractable Stochastic Oracle

4.1 Online Validation Based Stochastic Mirror Descent

4.2 Validity of Stochastic Oracle and Iteration Complexity

Proposition 4.1

Theorem 4.2

5 SFLS with OVSMD as its Stochastic Oracle

5.1 Theoretical Analysis

Corollary 5.1

Theorem 5.2

5.2 Implementation Guidelines

Corollary 5.3

6 Numerical Experiments

6.1 Computational Setup

6.2 Approximate Linear Programming for Markov Decision Processes

6.3 Multi-class Neyman-Pearson classification

6.4 Learning with Fairness Constraints

7 Conclusion

8 Proofs of Theoretical Results

Proof 8.1

Lemma 8.2

Proof 8.3

Proof 8.4

Proof 8.5

Lemma 8.6** **(Equation (2.37) and Lemma 6.1 in Nemirovski et al. 2009)

Lemma 8.7** **(Lemma 2 in Lan et al. 2012)

Lemma 8.8

Proof 8.9

Lemma 8.10

Proof 8.11

Lemma 8.12

Proof 8.13

Proof 8.14

Proof 8.15

Proof 8.16

Lemma 8.17

Proof 8.18

Proof 8.19

Proof 8.20

Definition 2.2 (Stochastic Oracle)

Definition 2.4 (Input tuple)

Lemma 8.6 (Equation (2.37) and Lemma 6.1 in Nemirovski et al. 2009)

Lemma 8.7 (Lemma 2 in Lan et al. 2012)