Sequential Quadratic Optimization for Stochastic Optimization with   Deterministic Nonlinear Inequality and Equality Constraints

Frank E. Curtis; Daniel P. Robinson; Baoyu Zhou

arXiv:2302.14790·math.OC·March 1, 2023·SIAM J. Optim.

Sequential Quadratic Optimization for Stochastic Optimization with Deterministic Nonlinear Inequality and Equality Constraints

Frank E. Curtis, Daniel P. Robinson, Baoyu Zhou

PDF

Open Access

TL;DR

This paper introduces a sequential quadratic optimization algorithm designed for stochastic problems with nonlinear constraints, where only stochastic gradient estimates are available for the objective, and provides convergence guarantees under mild assumptions.

Contribution

It proposes a novel algorithm for stochastic constrained optimization that works with only stochastic gradient estimates and proves its convergence under loose assumptions.

Findings

01

Algorithm outperforms alternative methods with more accurate gradient estimates

02

Convergence guarantees are established under unbiased gradient estimates

03

Numerical experiments demonstrate practical effectiveness

Abstract

A sequential quadratic optimization algorithm for minimizing an objective function defined by an expectation subject to nonlinear inequality and equality constraints is proposed, analyzed, and tested. The context of interest is when it is tractable to evaluate constraint function and derivative values in each iteration, but it is intractable to evaluate the objective function or its derivatives in any iteration, and instead an algorithm can only make use of stochastic objective gradient estimates. Under loose assumptions, including that the gradient estimates are unbiased, the algorithm is proved to possess convergence guarantees in expectation. The results of numerical experiments are presented to demonstrate that the proposed algorithm can outperform an alternative approach that relies on the ability to compute more accurate gradient estimates.

Equations234

x \in R^{n} min f (x) subject to (s.t.) c (x) = 0 and x \geq 0 with f (x) = E_{ω} [F (x, ω)],

x \in R^{n} min f (x) subject to (s.t.) c (x) = 0 and x \geq 0 with f (x) = E_{ω} [F (x, ω)],

f (x) \geq f_{i n f}, ∥\nabla f (x) ∥_{2} \leq κ_{\nabla f}, ∥ c (x) ∥_{2} \leq κ_{c}, and ∥\nabla c (x) ∥_{2} \leq κ_{\nabla c},

f (x) \geq f_{i n f}, ∥\nabla f (x) ∥_{2} \leq κ_{\nabla f}, ∥ c (x) ∥_{2} \leq κ_{c}, and ∥\nabla c (x) ∥_{2} \leq κ_{\nabla c},

∥\nabla f (x) - \nabla f (\overline{x}) ∥_{2} \leq L ∥ x - \overline{x} ∥_{2} and ∥\nabla c (x)^{T} - \nabla c (\overline{x})^{T} ∥_{2} \leq Γ∥ x - \overline{x} ∥_{2} .

∥\nabla f (x) - \nabla f (\overline{x}) ∥_{2} \leq L ∥ x - \overline{x} ∥_{2} and ∥\nabla c (x)^{T} - \nabla c (\overline{x})^{T} ∥_{2} \leq Γ∥ x - \overline{x} ∥_{2} .

\nabla f (x) + \nabla c (x) y - z = 0, c (x) = 0, 0 \leq x ⊥ z \geq 0.

\nabla f (x) + \nabla c (x) y - z = 0, c (x) = 0, 0 \leq x ⊥ z \geq 0.

0 \leq x ⊥ \nabla c (x) c (x) \geq 0.

0 \leq x ⊥ \nabla c (x) c (x) \geq 0.

E_{k} [G_{k}] = \nabla f (X_{k}) and E_{k} [∥ G_{k} - \nabla f (X_{k}) ∥_{2}^{2}] \leq ρ_{k} .

E_{k} [G_{k}] = \nabla f (X_{k}) and E_{k} [∥ G_{k} - \nabla f (X_{k}) ∥_{2}^{2}] \leq ρ_{k} .

u \in R^{n}, w \in R^{m} min

u \in R^{n}, w \in R^{m} min

s. t.

d \in R^{n} min g_{k}^{T} d + \frac{1}{2} d^{T} H_{k} d s. t. \nabla c (x_{k})^{T} d = \nabla c (x_{k})^{T} v_{k} and x_{k} + d \geq 0.

d \in R^{n} min g_{k}^{T} d + \frac{1}{2} d^{T} H_{k} d s. t. \nabla c (x_{k})^{T} d = \nabla c (x_{k})^{T} v_{k} and x_{k} + d \geq 0.

Δ l (x_{k}, τ_{k}, g_{k}, d_{k}) :=

Δ l (x_{k}, τ_{k}, g_{k}, d_{k}) :=

=

τ_{k}^{trial} \leftarrow ⎩ ⎨ ⎧ \infty \frac{( 1 - σ ) ( ∥ c _{k} ∥ _{2} - ∥ c _{k} + \nabla c ( x _{k} ) ^{T} d _{k} ∥ _{2} )}{g _{k}^{T} d _{k} + \frac{1}{2} d _{k}^{T} H _{k} d _{k}} if g_{k}^{T} d_{k} + \frac{1}{2} d_{k}^{T} H_{k} d_{k} \leq 0 otherwise,

τ_{k}^{trial} \leftarrow ⎩ ⎨ ⎧ \infty \frac{( 1 - σ ) ( ∥ c _{k} ∥ _{2} - ∥ c _{k} + \nabla c ( x _{k} ) ^{T} d _{k} ∥ _{2} )}{g _{k}^{T} d _{k} + \frac{1}{2} d _{k}^{T} H _{k} d _{k}} if g_{k}^{T} d_{k} + \frac{1}{2} d_{k}^{T} H_{k} d_{k} \leq 0 otherwise,

τ_{k} \leftarrow {τ_{k - 1} min {(1 - ϵ_{τ}) τ_{k - 1}, τ_{k}^{trial}} if τ_{k - 1} \leq τ_{k}^{trial} otherwise.

τ_{k} \leftarrow {τ_{k - 1} min {(1 - ϵ_{τ}) τ_{k - 1}, τ_{k}^{trial}} if τ_{k - 1} \leq τ_{k}^{trial} otherwise.

ξ_{k}^{trial} \leftarrow \frac{Δ l ( x _{k} , τ _{k} , g _{k} , d _{k} )}{τ _{k} ∥ d _{k} ∥ _{2}^{2}}, then ξ_{k} \leftarrow {ξ_{k - 1} min {(1 - ϵ_{ξ}) ξ_{k - 1}, ξ_{k}^{trial}} if ξ_{k - 1} \leq ξ_{k}^{trial} otherwise,

ξ_{k}^{trial} \leftarrow \frac{Δ l ( x _{k} , τ _{k} , g _{k} , d _{k} )}{τ _{k} ∥ d _{k} ∥ _{2}^{2}}, then ξ_{k} \leftarrow {ξ_{k - 1} min {(1 - ϵ_{ξ}) ξ_{k - 1}, ξ_{k}^{trial}} if ξ_{k - 1} \leq ξ_{k}^{trial} otherwise,

α_{k}^{m i n} \leftarrow \frac{2 ( 1 - η ) β _{k} ξ _{k} τ _{k}}{τ _{k} L + Γ} \in (0, 1] for all k \in N,

α_{k}^{m i n} \leftarrow \frac{2 ( 1 - η ) β _{k} ξ _{k} τ _{k}}{τ _{k} L + Γ} \in (0, 1] for all k \in N,

φ_{k} (α) =

φ_{k} (α) =

+ α (∥ c_{k} ∥_{2} - ∥ c_{k} + \nabla c (x_{k})^{T} d_{k} ∥_{2}) + \frac{1}{2} (τ_{k} L + Γ) α^{2} ∥ d_{k} ∥_{2}^{2},

α_{k}^{φ}

α_{k}^{φ}

and α_{k}^{m a x}

u \in R^{n}, w \in R^{m} min

u \in R^{n}, w \in R^{m} min

s. t.

\nabla c (x)^{T} \nabla c (x) c (x) + \nabla c (x)^{T} \nabla c (x) \nabla c (x)^{T} \nabla c (x) w - \nabla c (x)^{T} δ

\nabla c (x)^{T} \nabla c (x) c (x) + \nabla c (x)^{T} \nabla c (x) \nabla c (x)^{T} \nabla c (x) w - \nabla c (x)^{T} δ

μu + \nabla c (x) γ - δ = 0, \nabla c (x)^{T} u = 0, and 0 \leq δ ⊥ x + u + \nabla c (x) w

\nabla c (x)^{T} \nabla c (x) c (x) - \nabla c (x)^{T} δ = 0, \nabla c (x) γ - δ = 0, and 0 \leq δ ⊥ x \geq 0.

\nabla c (x)^{T} \nabla c (x) c (x) - \nabla c (x)^{T} δ = 0, \nabla c (x) γ - δ = 0, and 0 \leq δ ⊥ x \geq 0.

\frac{1}{2} ∥ c (x) ∥_{2}^{2} \geq \frac{1}{2} ∥ c (x) + \nabla c (x)^{T} \nabla c (x) w ∥_{2}^{2} + \frac{1}{2} μ ∥ u ∥_{2}^{2} > \frac{1}{2} ∥ c (x) + \nabla c (x)^{T} \nabla c (x) w ∥_{2}^{2},

\frac{1}{2} ∥ c (x) ∥_{2}^{2} \geq \frac{1}{2} ∥ c (x) + \nabla c (x)^{T} \nabla c (x) w ∥_{2}^{2} + \frac{1}{2} μ ∥ u ∥_{2}^{2} > \frac{1}{2} ∥ c (x) + \nabla c (x)^{T} \nabla c (x) w ∥_{2}^{2},

j (x) j (x)^{T} j (x) j (x)^{T} 00 j (x)_{A_{*}}^{T} 0 μ I j (x) I_{A_{*}} 0 j (x)^{T} 00 - j (x)_{A_{*}} - I_{A_{*}}^{T} 00 w u γ δ_{A_{*}} = - j (x) j (x)^{T} c (x) 00 - x_{A_{*}} .

j (x) j (x)^{T} j (x) j (x)^{T} 00 j (x)_{A_{*}}^{T} 0 μ I j (x) I_{A_{*}} 0 j (x)^{T} 00 - j (x)_{A_{*}} - I_{A_{*}}^{T} 00 w u γ δ_{A_{*}} = - j (x) j (x)^{T} c (x) 00 - x_{A_{*}} .

∥ c_{k} ∥_{2} - ∥ c_{k} + \nabla c (x_{k})^{T} v_{k} ∥_{2} \geq κ_{v, 2} ∥ v_{k} ∥_{2}^{2} for all k \in S_{λ},

∥ c_{k} ∥_{2} - ∥ c_{k} + \nabla c (x_{k})^{T} v_{k} ∥_{2} \geq κ_{v, 2} ∥ v_{k} ∥_{2}^{2} for all k \in S_{λ},

∥ c_{k} ∥_{2}^{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2}^{2} = (∥ c_{k} ∥_{2} + ∥ c_{k} + j_{k} v_{k} ∥_{2}) (∥ c_{k} ∥_{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2})

∥ c_{k} ∥_{2}^{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2}^{2} = (∥ c_{k} ∥_{2} + ∥ c_{k} + j_{k} v_{k} ∥_{2}) (∥ c_{k} ∥_{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2})

\leq

α \in [0, 1] min \frac{1}{2} ∥ c_{k} + α j_{k} j_{k}^{T} w_{k} ∥_{2}^{2} + \frac{1}{2} μ_{k} ∥ α u_{k} ∥_{2}^{2},

α \in [0, 1] min \frac{1}{2} ∥ c_{k} + α j_{k} j_{k}^{T} w_{k} ∥_{2}^{2} + \frac{1}{2} μ_{k} ∥ α u_{k} ∥_{2}^{2},

∥ c_{k} ∥_{2}^{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2}^{2}

∥ c_{k} ∥_{2}^{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2}^{2}

= - 2 c_{k}^{T} j_{k} j_{k}^{T} w_{k} - ∥ j_{k} j_{k}^{T} w_{k} ∥_{2}^{2} \geq ∥ j_{k} j_{k}^{T} w_{k} ∥_{2}^{2} + 2 μ_{k} ∥ u_{k} ∥_{2}^{2} .

∥ c_{k} ∥_{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2}

∥ c_{k} ∥_{2} - ∥ c_{k} + j_{k} v_{k} ∥_{2}

\geq (2 κ_{c})^{- 1} (∥ j_{k} j_{k}^{T} w_{k} ∥_{2}^{2} + 2 μ_{k} ∥ u_{k} ∥_{2}^{2})

\geq (2 κ_{c})^{- 1} (λ^{2} ∥ w_{k} ∥_{2}^{2} + 2 μ_{k} ∥ u_{k} ∥_{2}^{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Multi-Objective Optimization Algorithms · Stochastic Gradient Optimization Techniques · Advanced Optimization Algorithms Research

Full text

affil0affil0affiliationtext: Department of Industrial and Systems Engineering, Lehigh Universityaffil1affil1affiliationtext: Booth School of Business, The University of Chicago

Sequential Quadratic Optimization for Stochastic Optimization with Deterministic Nonlinear Inequality and Equality Constraints

Frank E. Curtis E-mail: [email protected]

Daniel P. Robinson E-mail: [email protected]

Baoyu Zhou E-mail: [email protected]

A sequential quadratic optimization algorithm for minimizing an objective function defined by an expectation subject to nonlinear inequality and equality constraints is proposed, analyzed, and tested. The context of interest is when it is tractable to evaluate constraint function and derivative values in each iteration, but it is intractable to evaluate the objective function or its derivatives in any iteration, and instead an algorithm can only make use of stochastic objective gradient estimates. Under loose assumptions, including that the gradient estimates are unbiased, the algorithm is proved to possess convergence guarantees in expectation. The results of numerical experiments are presented to demonstrate that the proposed algorithm can outperform an alternative approach that relies on the ability to compute more accurate gradient estimates.

1 Introduction

We propose a sequential quadratic optimization (commonly known as SQP) algorithm for minimizing an objective function defined by an expectation subject to nonlinear inequality and equality constraints. Such optimization problems arise in a plethora of application areas, including, but not limited to, machine learning [28], network optimization [7], resource allocation [26], portfolio optimization [37], risk-averse partial-differential-equation-constrained optimization [27], maximum-likelihood estimation [25], and multi-stage optimization [40].

The design and analysis of deterministic algorithms for solving continuous optimization problems involving inequality and equality constraints has been a well-studied topic for decades. Numerous types of such algorithms, such as penalty methods, interior-point methods, and SQP methods, have been designed to solve such problems. Penalty methods are based on the idea of using unconstrained optimization algorithms to minimize a weighted sum—determined by a penalty parameter—of the objective and a measure of constraint violation; e.g., see [11, 19, 45] for algorithms that make use of nondifferentiable (exact) penalty functions and see [15, 14, 21, 46] for algorithms that make use of differentiable (exact) penalty functions. While they are able to offer convergence guarantees from remote starting points, the numerical performance of penalty methods often suffers from ill-conditioning of the penalty functions and/or sensitivity of the algorithm’s performance on the particular scheme employed for updating the penalty parameter [34]. Interior-point methods [16] are designed to use barrier functions to guide the algorithm along a central path through the interior of the feasible region (or, at least, the interior of a set defined by bounds on a subset of the variables) to a solution [9, 10, 29, 30, 43, 44]. Such algorithms have been shown to be very effective in practice, which is why many state-of-the-art software packages for continuous nonlinear optimization are built on interior-point methods; see, e.g., [10, 42]. Overall, both penalty and interior-point methods involve the use of additional objective terms to handle the presence of inequality constraints.

Alternatively, in this paper, we present, analyze, and demonstrate the numerical performance of an SQP method for solving continuous nonlinear optimization problems. The SQP paradigm is based on the idea of, at each iterate, solving a subproblem (or subproblems) defined based on a local linearization of the constraint function and a local quadratic approximation of the objective. Unlike in the deterministic setting, for which numerous SQP algorithms have been proposed (see, e.g., [18, 20, 24, 34]), there have been few stochastic algorithms proposed for the setting of solving optimization problems with nonlinear constraints. That said, in the past few years, a couple of classes of stochastic SQP methods have been designed for optimization subject to nonlinear equality constraints. For example, the article [1] proposes an SQP algorithm that uses stochastic objective gradient estimates for solving such problems that employs an adaptive step size policy based on Lipschitz constants (or estimates of them). For an alternative setting in which one is willing to compute objective value estimates as well, and to refine objective function and gradient estimates within a given iteration until probabilistic conditions of accuracy are satisfied, the article [32] proposes a line-search stochastic SQP method. There have subsequently been multiple extensions of the methods in [1] and [32], as well as work on different, but related algorithmic strategies—still for the setting of only nonlinear equality constraints. There has been work on relaxing constraint qualifications [3], allowing matrix-free and inexact solves of the arising linear systems [12], using a trust-region methodology [17], incorporating noisy (potentially biased) function and gradient estimates [5, 35], employing variance-reduction strategies [2, 4], considering sketch-and-project techniques [33], and analyzing the worst-case complexity (see [13]) of the method proposed in [1].

Unlike the setting of equality constraints only, to our knowledge there has been very little work on the design and analysis of stochastic algorithms for optimization subject to nonlinear (nonconvex) inequality and equality constraints. Three exceptions are: the active-set line-search SQP algorithms proposed in [31] and (very recently) in [38] and the momentum-based augmented Lagrangian method (a penalty method) proposed in [41]. We expect that our proposed SQP algorithm will perform well in comparison to a stochastic-gradient-based penalty method. We demonstrate with numerical experiments that our approach can outperform the algorithm proposed in [31]. We remark in passing that interior-point methods often outperform SQP methods in the deterministic setting, but as far as we are aware there exists no interior-point method designed for the stochastic setting that we consider.

1.1 Contributions

In this paper, we build on the algorithmic strategy and analysis in [1] to propose and analyze an adaptive stochastic SQP algorithm for solving nonlinear optimization problems subject to (deterministic) inequality and equality constraints. This work involves significant advancements beyond that in [1] that are necessary since, unlike in the setting of only having equality constraints, the presence of inequality constraints automatically guarantees that, at a given iterate, the search direction computed in a stochastic SQP method will be a biased estimate of the “true” search direction, i.e., the one that would be computed if the actual gradient of the objective function were available. This necessitates a distinct change in the design of the algorithm as well as distinct alterations to the convergence analysis, since the analysis in [1] relies heavily on the search directions being (conditionally) unbiased estimators of their “true” counterparts. The algorithm from the literature that can be seen as the nearest alternative approach is the algorithm in [31]. However, there are substantial differences between the algorithm and analysis in [31] and those presented in this paper. Like in [32] for the equality-only case, the algorithm in [31] is designed for the setting in which one is willing to refine function and gradient estimates within an iteration until probabilistic conditions of accuracy are satisfied, and in this manner the analysis of that algorithm offers guarantees that are relatively closer to those offered for a deterministic algorithm. By contrast, the algorithm in this paper, like the algorithm in [1], is designed to allow the stochastic gradient estimates to be potentially much less accurate, and in such a context we are satisfied with offering convergence guarantees in expectation. We compare the numerical performance of our proposed algorithm with that in [31] to demonstrate that there are settings in which our proposed approach has advantages in practice.

1.2 Notation

We use $\mathbb{R}$ to denote the set of real numbers, ${\overline{\mathbb{R}\mkern-2.0mu}\mkern 2.0mu}$ to denote the set of extended-real numbers (i.e., ${\overline{\mathbb{R}\mkern-2.0mu}\mkern 2.0mu}:=\mathbb{R}\cup\{-\infty,\infty\}$ ), and $\mathbb{R}_{\geq a}$ (resp., $\mathbb{R}_{>a}$ ) to denote the set of real numbers greater than or equal to (resp., greater than) $a\in\mathbb{R}$ . We append a superscript to such a set to denote the space of vectors or matrices whose elements are restricted to the indicated set; e.g., we use $\mathbb{R}^{n}$ to denote the set of $n$ -dimensional real vectors and $\mathbb{R}^{m\times n}$ to denote the set of $m$ -by- $n$ -dimensional real matrices. We use $\mathbb{N}:=\{1,2,\dots\}$ to denote the set of positive integers and, given $n\in\mathbb{N}$ , we use $[n]:=\{1,\dots,n\}$ to denote the set of positive integers less than or equal to $n$ . Given $(a,b)\in\mathbb{R}^{n}\times\mathbb{R}^{n}$ , we write $a\perp b$ to mean, with $a_{i}$ and $b_{i}$ denoting the $i$ th elements of $a$ and $b$ , respectively, that $a_{i}=0$ and/or $b_{i}=0$ for all $i\in[n]$ . Given real symmetric matrices $A\in\mathbb{R}^{n\times n}$ and $B\in\mathbb{R}^{n\times n}$ , we write $A\succeq B$ (resp., $A\succ B$ ) to indicate that $A-B$ is positive semidefinite (resp., positive definite). Given $H\in\mathbb{R}^{n\times n}$ with $H\succ 0$ and $a\in\mathbb{R}^{n}$ , we denote the norm $\|a\|_{H}:=\sqrt{a^{T}Ha}$ .

Our problem of interest is defined with respect to a variable $x\in\mathbb{R}^{n}$ and the algorithm that we propose and analyze is iterative, meaning that, in any run, it generates an iterate sequence that we denote as $\{x_{k}\}$ with $x_{k}\in\mathbb{R}^{n}$ for all generated $k\in\mathbb{N}$ , i.e., $\{x_{k}\}\subset\mathbb{R}^{n}$ . (We use such notation throughout the paper when the elements of sequence are contained within a given set. We say “for all generated $k\in\mathbb{N}$ ” since our proposed algorithm might terminate finitely. Whether a subscript is being used to indicate the element of a vector or the index number of a sequence is always made clear by the context. The $i$ th element of an iterate $x_{k}$ is denoted $[x_{k}]_{i}$ .) We use subscripts similarly to denote other quantities corresponding to each iteration of the algorithm; e.g., we introduce a merit parameter denoted as $\tau\in\mathbb{R}_{>0}$ whose value in iteration $k\in\mathbb{N}$ is denoted as $\tau_{k}\in\mathbb{R}_{>0}$ , and corresponding to a constraint function $c$ (see problem (1) below) we denote its value at $x_{k}$ as $c_{k}:=c(x_{k})$ .

The iteration-dependent quantities mentioned in the previous paragraph—and additional ones introduced in the description of our algorithm—represent realizations of the random variables in a stochastic process generated by the algorithm. Specifically, the behavior of our algorithm is dictated by prescribed initial conditions and a sequence of stochastic objective gradient estimators that we denote by $\{G_{k}\}$ . After proving preliminary results that hold for every run of the algorithm, we present our ultimate convergence theory for our algorithm in terms of a filtration defined in terms of $\sigma$ -algebras dependent on the initial conditions of the algorithm and $\{G_{k}\}$ .

1.3 Organization

A statement of our problem of interest and preliminary assumptions about its objective and constraint functions, as well as about user-defined quantities in our proposed algorithm, are stated in Section 2. A description of our proposed algorithm is provided in Section 3. Convergence-in-expectation of the algorithm is proved under reasonable assumptions in Section 4. The results of numerical experiments are presented in Section 5 and concluding remarks are given in Section 6.

2 Setting

We formulate our problem of interest as

[TABLE]

where $f:\mathbb{R}^{n}\to\mathbb{R}$ and $c:\mathbb{R}^{n}\to\mathbb{R}^{m}$ are continuously differentiable, $\omega$ is a random variable with associated probability space $(\Omega,{\cal F},\mathbb{P})$ , $F:\mathbb{R}^{n}\times\Omega\to\mathbb{R}$ , and $\mathbb{E}_{\omega}[\cdot]$ denotes expectation taken with respect to the distribution of $\omega$ . Our algorithm and analysis extend easily to the setting in which the nonnegativity constraint in (1) is generalized to $l\leq x\leq u$ for some $(l,u)\in{\overline{\mathbb{R}\mkern-2.0mu}\mkern 2.0mu}{}^{n}\times{\overline{\mathbb{R}\mkern-2.0mu}\mkern 2.0mu}{}^{n}$ with $l_{i}\leq u_{i}$ for all $i\in[n]$ ; we merely consider nonnegativity in (1) for the sake of notational simplicity. It is also worth mentioning that any smooth constrained optimization problem can be reformulated as (1) (or at least as such a problem with generalized bound constraints); e.g., inequality constraints $c_{\cal I}(x)\leq 0$ , where $c_{{\cal I}}:\mathbb{R}^{n}\to\mathbb{R}^{m_{{\cal I}}}$ is continuously differentiable, can be reformulated to fit into the form of (1) through the incorporation of slack variables, say $s\in\mathbb{R}^{m_{{\cal I}}}$ , to have the constraints $c_{{\cal I}}(x)+s_{{\cal I}}=0$ and $s_{{\cal I}}\geq 0$ .

We make the following assumption throughout the remainder of the paper pertaining to the functions in problem (1) and our proposed algorithm. As seen in the following section, our algorithm seeks feasibility and stationarity with respect to (1) by generating an iterate sequence that stays feasible with respect to the bound constraints, meaning that, in any run of the algorithm, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ for all generated $k\in\mathbb{N}$ .

Assumption 1.

Let ${\cal X}\subset\mathbb{R}^{n}$ be an open convex set that almost-surely contains the iterate sequence $\{x_{k}\}\subset\mathbb{R}^{n}_{\geq 0}$ generated in any realization of a run of the algorithm. The objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ is continuously differentiable and bounded below over ${\cal X}$ and the objective gradient function $\nabla f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is Lipschitz continuous and bounded in norm over ${\cal X}$ . Similarly, for all $i\in[m]$ , the constraint function $c_{i}:\mathbb{R}^{n}\to\mathbb{R}$ is continuously differentiable and bounded over ${\cal X}$ and the constraint gradient function $\nabla c_{i}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is Lipschitz continuous and bounded in norm over ${\cal X}$ . Finally, the constraint Jacobian $\nabla c^{T}:\mathbb{R}^{n}\to\mathbb{R}^{m\times n}$ has full row rank over ${\cal X}$ .

Under Assumption 1, there exists $f_{\inf}\in\mathbb{R}$ and a tuple of positive constants $(\kappa_{\nabla f},\kappa_{c},\kappa_{\nabla c},L,\Gamma)\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ such that for all $x\in{\cal X}$ one has

[TABLE]

and for all $(x,\mkern 1.5mu\overline{\mkern-1.5mux})\in{\cal X}\times{\cal X}$ one has

[TABLE]

In addition, due to the continuous differentiability of the objective and constraint functions and the full row rank of the constraint Jacobians, it follows that at any (local) minimizer of (1), call it $x\in\mathbb{R}^{n}$ , there exists $y\in\mathbb{R}^{m}$ and $z\in\mathbb{R}^{n}$ such that the following Karush-Kuhn-Tucker (KKT) conditions are satisfied:

[TABLE]

We refer to any $x\in\mathbb{R}^{n}$ such that there exists $(y,z)\in\mathbb{R}^{m}\times\mathbb{R}^{n}$ satisfying (4) as a first-order stationary point (or KKT point) with respect to (1).

Since our algorithm generates iterates that are feasible with respect to the bound constraints, but not necessarily with respect to the equality constraints, we need to account for the possible existence of points that are infeasible for (1), but are stationary with respect to the minimization of a constraint violation measure over $\mathbb{R}^{n}_{\geq 0}$ . We refer to a point that is infeasible for (1) as an infeasible stationary point if it is stationary with respect to the minimization of $\tfrac{1}{2}\|c(x)\|_{2}^{2}$ subject to $x\in\mathbb{R}^{n}_{\geq 0}$ , meaning

[TABLE]

Each iteration of our algorithm requires a stochastic estimate of the gradient of the objective at the current iterate. In a given run at iteration $k\in\mathbb{N}$ , the realization of the iterate and gradient estimate is $(x_{k},g_{k})$ , which later in our analysis we denote as a realization of the pair of random variables $(X_{k},G_{k})$ . (See Section 4.3 for a complete description of a stochastic process that we analyze.) With respect to the gradient estimators, we make Assumption 2 below. For the prescribed (i.e., not random) sequence $\{\rho_{k}\}\subset\mathbb{R}_{>0}$ referenced in the assumption, we state precise conditions that it must satisfy in Section 4.3. In the assumption and throughout the remainder of the paper, we use $\mathbb{E}_{k}[\cdot]$ to denote expectation taken with respect to the distribution of $\omega$ conditioned on a trace $\sigma$ -algebra of an event ${\cal E}$ , denoted by ${\cal F}_{k}$ ; see Section 4.3.

Assumption 2.

For a prescribed $\{\rho_{k}\}\subset\mathbb{R}_{>0}$ , one finds for all $k\in\mathbb{N}$ that

[TABLE]

One might relax the latter condition in (6) and obtain guarantees that are similar to those that we prove; see, e.g., [36]. We employ (6) for simplicity, since it is sufficient for demonstrating the guarantees that our algorithmic approach can offer.

Each iteration of our algorithm also makes use of a symmetric and positive-definite (SPD) matrix, denoted as $H_{k}\in\mathbb{R}^{n\times n}$ for iteration $k\in\mathbb{N}$ , to define a quadratic term in the subproblem that is solved for computing the search direction. For simplicity, we assume that the sequence $\{H_{k}\}$ is prescribed, e.g., one may consider $H_{k}=I$ for all $k\in\mathbb{N}$ . More generally, one could consider a more sophisticated scheme such as setting, for all $k\in\mathbb{N}$ , the matrix $H_{k}$ as a stochastic estimate of the Hessian of the objective function and/or a Lagrangian function as long as it is sufficiently positive definite and bounded and the choice is made to be conditionally uncorrelated with the stochastic gradient estimate. However, since considering such a loose requirement would only obfuscate our analysis without adding significant value, we assume for simplicity that $\{H_{k}\}$ is prescribed and merely satisfies the following.

Assumption 3.

There exists $(\kappa_{H},\zeta)\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ with $\kappa_{H}\geq\zeta$ such that, for all $k\in\mathbb{N}$ , the SPD matrix $H_{k}\in\mathbb{R}^{n\times n}$ has $\kappa_{H}I\succeq H_{k}\succeq\zeta I$ .

Observe from Assumption 3 that we are not assuming that accurate second-order information is being used by the algorithm. Hence, our convergence guarantees are of the type that may be expected for a first-order-type algorithm, although in situations when it is computationally tractable, one might find better performance if $H_{k}$ incorporates some (approximate) second-order derivative information.

3 Algorithm

In this section, we present our proposed algorithm. We state the algorithm in terms of a particular realization of it (e.g., denoting the iterate for each $k\in\mathbb{N}$ as $x_{k}$ ), although our subsequent analysis of it (starting in Section 4.3) will be written in terms of the stochastic process that the algorithm defines.

Each iteration of our algorithm proceeds as follows. First, given the current iterate $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , the algorithm computes a direction whose purpose is to determine the progress that can be made in terms of reducing a measure of violation of a linearization of the equality constraints subject to the bound constraints. This is done in a manner that regularizes the component of the direction that lies in the null space of the constraint Jacobian. Specifically, the iteration commences by computing a direction $v_{k}:=u_{k}+\nabla c(x_{k})w_{k}\in\mathbb{R}^{n}$ , where $u_{k}\in{\rm Null}(\nabla c(x_{k})^{T})$ and $\nabla c(x_{k})w_{k}\in\operatorname{Range}(\nabla c(x_{k}))$ , by solving the quadratic optimization subproblem

[TABLE]

where $\mu_{k}\in\mathbb{R}_{>0}$ is a user-prescribed parameter. Observe that since $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , this subproblem is always feasible, and by construction it is convex. Generally, the solution of (7) might not be unique, but in our setting it is unique since $\nabla c(x_{k})^{T}$ has full row rank. In our analysis, we show that the solution of subproblem (7) is given by $(u_{k},w_{k})=(0,0)$ if and only if the current iterate $x_{k}$ is stationary with respect to the minimization of $\tfrac{1}{2}\|c(x)\|_{2}^{2}$ over $x\in\mathbb{R}^{n}_{\geq 0}$ . This means, e.g., that if $c_{k}\neq 0$ , but the solution of (7) is $(u_{k},w_{k})=(0,0)$ —which, by the Fundamental Theorem of Linear Algebra, occurs if and only if $v_{k}=u_{k}+\nabla c(x_{k})w_{k}=0$ —then it is reasonable to terminate since $x_{k}$ is an infeasible stationary point (see (5)), as in our algorithm.

After computing $v_{k}\in\mathbb{R}^{n}$ by solving (7) and generating a stochastic objective gradient estimate $g_{k}\in\mathbb{R}^{n}$ (see Assumption 2), the algorithm next computes a search direction $d_{k}\in\mathbb{R}^{n}$ by solving the quadratic optimization subproblem

[TABLE]

By construction, this subproblem is feasible, and under Assumption 3 it is convex. The search direction $d_{k}$ is designed to achieve the same progress toward linearized feasibility within the nonnegative orthant that is achieved by $v_{k}$ , then within the null space of $\nabla c(x_{k})^{T}$ and the nonnegative orthant aims to minimize a (stochastically estimated) local quadratic approximation of the objective function at $x_{k}$ .

The remainder of the $k$ th iteration proceeds in a similar manner as in [1, 12]. In particular, with the $\ell_{2}$ -norm merit function in mind, namely, $\phi:\mathbb{R}^{n}\times\mathbb{R}_{>0}\to\mathbb{R}$ defined by $\phi(x,\tau)=\tau f(x)+\|c(x)\|_{2}$ , the algorithm next sets a value for the merit parameter $\tau_{k}\in\mathbb{R}_{>0}$ . This is done by considering a local model of this merit function, namely, $l:\mathbb{R}^{n}\times\mathbb{R}_{>0}\times\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}$ defined by $l(x,\tau,g,d)=\tau(f(x)+g^{T}d)+\|c(x)+\nabla c(x)^{T}d\|_{2}$ , and in particular the reduction in this model defined for all $k\in\mathbb{N}$ by

[TABLE]

and setting $\tau_{k}$ such that this reduction is sufficiently large. Specifically, if $d_{k}\neq 0$ , then with user-prescribed $(\epsilon_{\tau},\sigma)\in(0,1)\times(0,1)$ , the algorithm first sets

[TABLE]

then sets the merit parameter value as

[TABLE]

(The value $\tau_{0}\in\mathbb{R}_{>0}$ is also prescribed by the user.) On the other hand, if $d_{k}=0$ , then the algorithm simply sets $\tau^{\rm trial}_{k}\leftarrow\infty$ and $\tau_{k}\leftarrow\tau_{k-1}$ . We show in our analysis (see Lemma 8) that this procedure for setting $\tau_{k}$ ensures that $\Delta l(x_{k},\tau_{k},g_{k},d_{k})$ is sufficiently large relative to the squared norm of the search direction and the improvement offered toward linearized feasibility. For use in the step size procedure, the algorithm next sets a value $\xi_{k}\in\mathbb{R}_{>0}$ (referred to as the ratio parameter) that acts as an estimate for a lower bound of the ratio between the model reduction and a multiple of the squared norm of the search direction. Specifically, if $d_{k}\neq 0$ , it sets

[TABLE]

where $(\xi_{0},\epsilon_{\xi})\in\mathbb{R}_{>0}\times(0,1)$ are user-prescribed parameters; see [1, 12] for further motivation. On the other hand, if $d_{k}=0$ , then it sets $\xi^{\rm trial}_{k}\leftarrow\infty$ and $\xi_{k}\leftarrow\xi_{k-1}$ .

The step size selection procedure, which for all $k\in\mathbb{N}$ chooses the step size $\alpha_{k}\in\mathbb{R}_{>0}$ , can now be summarized as follows. First, suppose that $d_{k}\neq 0$ . With user-prescribed $\eta\in(0,1)$ , $\theta\in\mathbb{R}_{>0}$ , and $\{\beta_{k}\}$ with $\beta_{k}\in(0,1]$ for all $k\in\mathbb{N}$ such that

[TABLE]

and with the strongly convex function $\varphi_{k}:\mathbb{R}_{\geq 0}\to\mathbb{R}$ defined by

[TABLE]

the algorithm sets the values

[TABLE]

The algorithm then chooses the step size $\alpha_{k}$ as any value in $[\alpha_{k}^{\min},\alpha_{k}^{\max}]$ . Second, if $d_{k}=0$ , then the algorithm simply sets all step size values to 1.

A complete statement of our algorithm is given as Algorithm 1.

4 Analysis

In this section, we provide theoretical results for Algorithm 1. We begin by introducing common assumptions under which one can establish stationarity measures for problem (1) that are defined by solutions of (7) and/or (8). These stationarity measures allow us to connect our convergence guarantees for Algorithm 1 with stationarity conditions for (1). Then, under Assumptions 1 and 3, we prove generally applicable results pertaining to the behavior of algorithmic quantities in any run of the algorithm. These results reveal that the algorithm is well defined in the sense that any run will either terminate and return an infeasible stationary point or generate an infinite sequence of iterates. We then consider convergence properties of the algorithm in the event that the (monotonically nonincreasing) merit parameter sequence eventually produces values that are sufficiently small, yet bounded away from zero, which, as shown in our analysis, means that the sequence ultimately becomes constant at a sufficiently small value. This analysis, which includes our main convergence results for the algorithm, is provided under Assumption 4 stated on page 4. We follow this analysis with a section on theoretical results related to the occurrence of the event in Assumption 4. As in [1] for the equality-constraints-only setting, this discussion illuminates the fact that while the event in Assumption 4 is not always guaranteed to occur due to the looseness of our assumptions about properties of the stochastic gradient estimates, the event represents likely behavior in practice, which shows that our convergence results about the algorithm are meaningful for real-world situations. We conclude this section with a discussion of the behavior of the algorithm in the deterministic setting, i.e., when the true gradient of the objective is employed in all iterations. This discussion is meant to provide confidence to a user that our algorithm is based on one that has state-of-the-art convergence properties under common assumptions in the deterministic setting.

4.1 Subproblems and Stationarity Measures

We begin by showing that subproblem (7) yields a zero solution if and only if the point defining the subproblem is feasible for problem (1) or an infeasible stationary point.

Lemma 1.

Suppose that Assumption 1 holds, $x\in{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ , and, given $\mu\in\mathbb{R}_{>0}$ , consider the quadratic optimization problem $($ recall (7) $)$

[TABLE]

Then, the unique optimal solution of problem (16) is $(u,w)=(0,0)$ if and only if $x$ is feasible for problem (1) or an infeasible stationary point $($ i.e., it satisfies (5) $)$ , whereas $(u,w)\neq(0,0)$ if and only if $\|c(x)\|_{2}>\|c(x)+\nabla c(x)^{T}\nabla c(x)w\|_{2}$ .

Proof.

Suppose the conditions of the lemma hold and let $(u,w)$ be the unique optimal solution of (16). Since $x\in\mathbb{R}^{n}_{\geq 0}$ , it follows that $(0,0)$ is feasible for (16). In addition, necessary and sufficient optimality conditions for (16) are that, corresponding to $(u,w)\in\mathbb{R}^{n}\times\mathbb{R}^{m}$ , there exists $(\gamma,\delta)\in\mathbb{R}^{m}\times\mathbb{R}^{n}$ with

[TABLE]

If $(u,w)=(0,0)$ , then it follows from (17) that

[TABLE]

Since $\nabla c(x)^{T}$ has full row rank, (18) implies $\gamma=(\nabla c(x)^{T}\nabla c(x))^{-1}\nabla c(x)^{T}\delta=c(x)$ , $\delta=\nabla c(x)c(x)$ , and $0\leq\nabla c(x)c(x)\perp x\geq 0$ , which from (5) means that $x$ is either feasible or an infeasible stationary point, as desired. On the other hand, if $x$ is either feasible or an infeasible stationary point, meaning $0\leq\nabla c(x)c(x)\perp x\geq 0$ , then $u=0$ , $w=0$ , $\gamma=c(x)$ , and $\delta=\nabla c(x)c(x)$ satisfy (17), and this solution (i.e., $(u,w)=(0,0)$ ) is unique since the objective of (16) is strongly convex.

Now let us show that the unique optimal solution of (16) is $(u,w)\neq(0,0)$ if and only if $\|c(x)\|_{2}>\|c(x)+\nabla c(x)^{T}\nabla c(x)w\|_{2}$ . If $\|c(x)\|_{2}>\|c(x)+\nabla c(x)^{T}\nabla c(x)w\|_{2}$ , then $w\neq 0$ follows trivially, giving the desired conclusion. To prove the reverse implication, let us consider two cases. If $u\neq 0$ , then, since $(0,0)$ is feasible for (16),

[TABLE]

as desired. Second, if $u=0$ and $w\neq 0$ , then $w$ is the minimizer of the strongly convex objective $\tfrac{1}{2}\|c(x)+\nabla c(x)^{T}\nabla c(x)w\|_{2}^{2}$ subject to $x+\nabla c(x)w\geq 0$ . Since [math] is feasible for this problem, $w\neq 0$ means that $\tfrac{1}{2}\|c(x)\|_{2}^{2}>\tfrac{1}{2}\|c(x)+\nabla c(x)^{T}\nabla c(x)w\|_{2}^{2}$ , as desired. ∎

We now show that, under common assumptions and given $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , the quantity $\|v_{k}\|_{2}^{2}$ , where $v_{k}\in\mathbb{R}^{n}$ solves subproblem (7), represents a stationarity measure with respect to the problem to minimize $\tfrac{1}{2}\|c(x)\|_{2}^{2}$ subject to $x\in\mathbb{R}^{n}_{\geq 0}$ . (The assumption in the lemma that $\mu_{k}=\mu\in\mathbb{R}_{>0}$ for all $k\in\mathbb{N}$ could be relaxed; see Remark 1 at the end of this subsection. We consider this case for the sake of brevity.)

Lemma 2.

Suppose that Assumption 1 holds and there exists infinite ${\cal S}\subseteq\mathbb{N}$ such that for some sequence $\{x_{k}\}\subset{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ one finds $\{x_{k}\}_{k\in{\cal S}}\to x_{*}$ for some $x_{*}\in{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ where, with ${\cal A}(x):=\{i\in[n]:x_{i}=0\}$ , $I_{{\cal A}(x)}$ denoting the matrix composed of rows of $I\in\mathbb{R}^{n\times n}$ corresponding to indices in ${\cal A}(x)$ , and $\nabla c(x)_{{\cal A}(x)}$ denoting the matrix composed of rows of $\nabla c(x)$ corresponding to indices in ${\cal A}(x)$ , one finds that

(i)

$[\nabla c(x_{*})c(x_{*})]_{i}>0$ * for all $i\in{\cal A}(x_{*})$ and* 2. (ii)

the following matrix has full row rank: $\begin{bmatrix}0&\nabla c(x_{*})^{T}\\ \nabla c(x_{*})_{{\cal A}(x_{*})}&I_{{\cal A}(x_{*})}\end{bmatrix}$ .

Then, with $\mu_{k}=\mu\in\mathbb{R}_{>0}$ for all $k\in\mathbb{N}$ , and with $(u_{k},w_{k})$ solving subproblem (7) and $v_{k}:=u_{k}+\nabla c(x)w_{k}$ for all $k\in\mathbb{N}$ , it follows that $x_{*}$ satisfies the stationarity conditions (5) if and only if $\{v_{k}\}_{k\in{\cal S}}\to 0$ .

Proof.

Let ${\cal A}_{*}:={\cal A}(x_{*})$ and $j(x):=\nabla c(x)^{T}$ and consider the linear system

[TABLE]

Since, under the conditions of the lemma, the matrix in this linear system is nonsingular when $x=x_{*}$ (e.g., this follows from [39, Theorem 1.5.1] and (ii)), it follows that there exists an open ball ${\cal B}_{*}$ centered at $x_{*}$ such that, for each $x\in{\cal B}_{*}\cap{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ , this linear system has a unique solution, call it $(w(x),u(x),\gamma(x),\delta_{{\cal A}_{*}}(x))$ , and—due to continuity of the left-hand-side matrix and right-hand-side vector with respect to $x$ —this solution varies continuously over ${\cal B}_{*}\cap{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ . If $x_{*}$ satisfies (5), then it follows that $(0,0,c(x_{*}),[j(x_{*})^{T}c(x_{*})]_{{\cal A}_{*}})$ (with $[j(x_{*})^{T}c(x_{*})]_{{\cal A}_{*}}>0$ ) is the unique solution of the system at $x=x_{*}$ , and for all $x\in{\cal B}_{*}\cap{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ the solution of the system in conjunction with $\delta_{i}=0$ for all $i\notin{\cal A}_{*}$ satisfies (17), meaning that the components $(u(x),w(x))$ represent the unique optimal solution of problem (16). Hence, with respect to the quantities in the lemma and using Assumption 1, one finds that $\{v_{k}\}_{k\in{\cal S}}\to 0$ , as desired. To prove the reverse inclusion, suppose that $\{v_{k}\}_{k\in{\cal S}}\to 0$ , from which it follows by the Fundamental Theorem of Linear Algebra and (ii) that $\{(u_{k},w_{k})\}_{k\in{\cal S}}\to 0$ . For all $k\in{\cal S}$ , let $(u_{k},w_{k},\gamma_{k},\delta_{k})$ be a primal-dual optimal solution of (7) (satisfying optimality conditions of the form in (17)). One finds under the conditions of the lemma that, for all sufficiently large $k\in{\cal S}$ , this solution has $[\delta_{k}]_{i}=0$ for all $i\notin{\cal A}(x_{*})$ whereas $(u_{k},w_{k},\gamma_{k},[\delta_{k}]_{{\cal A}_{*}})$ solves the linear system above at $x=x_{k}$ . Since, by the arguments above, this solution varies continuously within ${\cal B}_{*}\cap{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ , the fact that $\{x_{k}\}_{k\in{\cal S}}\to x_{*}$ implies that $x_{*}$ satisfies (5), as desired. ∎

In fact, under the conditions of the prior lemma, the quantity $\|c_{k}\|_{2}-\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}$ also represents a stationarity measure for the problem to minimize $\tfrac{1}{2}\|c(x)\|_{2}^{2}$ subject to $x\in\mathbb{R}^{n}_{\geq 0}$ . This is shown in the following lemma.

Lemma 3.

Suppose that Assumption 1 holds, $\mu_{k}=\mu\in\mathbb{R}_{>0}$ for all $k\in\mathbb{N}$ , and there exists $\lambda\in\mathbb{R}_{>0}$ and infinite ${\cal S}_{\lambda}\subseteq\mathbb{N}$ such that for some $\{x_{k}\}\subset{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ one finds $\nabla c(x_{k})^{T}\nabla c(x_{k})\succeq\lambda I$ for all $k\in{\cal S}_{\lambda}$ . Then, there exists $\kappa_{v,2}\in\mathbb{R}_{>0}$ such that

[TABLE]

where $v_{k}=u_{k}+\nabla c(x_{k})w_{k}$ with $(u_{k},w_{k})$ being the unique optimal solution of (7). Consequently, under the conditions of Lemma 2 with ${\cal S}$ defined in that lemma, if ${\cal S}_{\lambda}$ defined as all sufficiently large indices in ${\cal S}$ satisfies the conditions above, then it follows that $\{v_{k}\}_{k\in{\cal S}}\to 0$ if and only if $\{\|c_{k}\|_{2}-\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}\}_{k\in{\cal S}}\to 0$ .

Proof.

Consider arbitrary $k\in{\cal S}_{\lambda}$ . Under the stated conditions with $j_{k}:=\nabla c(x_{k})^{T}$ , Lemma 1 implies $\|c_{k}+j_{k}v_{k}\|_{2}\leq\|c_{k}\|_{2}$ . Hence, by Assumption 1,

[TABLE]

If $v_{k}=0$ , then (19) follows trivially. Hence, we may proceed under the assumption that $v_{k}\neq 0$ , which by $v_{k}=u_{k}+j_{k}^{T}w_{k}$ and the Fundamental Theorem of Linear Algebra means that $u_{k}\neq 0$ and/or $w_{k}\neq 0$ . If $w_{k}=0$ , then it follows by construction of (7) that $u_{k}=0$ as well. Hence, we may conclude from $v_{k}\neq 0$ that, in fact, $w_{k}\neq 0$ . Since $(u_{k},w_{k})$ is the unique optimal solution of (7), it follows that $\alpha_{k}^{*}=1$ is the optimal solution of the strongly convex quadratic optimization problem

[TABLE]

which further implies (since an optimality condition of (21) is that the derivative of its objective function with respect to $\alpha$ is less than or equal to zero at $\alpha_{k}^{*}=1$ ) that $-c_{k}^{T}j_{k}j_{k}^{T}w_{k}\geq\|j_{k}j_{k}^{T}w_{k}\|_{2}^{2}+\mu_{k}\|u_{k}\|_{2}^{2}$ . Consequently, one finds

[TABLE]

With (20) and (22), it follows from Assumption 1, the conditions of the lemma, and since the Cauchy-Schwarz inequality implies $\|w_{k}\|_{2}\geq\|j_{k}^{T}w_{k}\|_{2}/\|j_{k}^{T}\|_{2}$ that

[TABLE]

which gives (19), as desired. ∎

Next, we show that if the point defining subproblem (8) is not an infeasible stationary point for problem (1), then the subproblem with $g_{k}=\nabla f(x_{k})$ yields a zero solution if and only if the point defining the subproblem is stationary for (1).

Lemma 4.

Suppose that Assumption 1 holds and, with respect to $x\in{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ , one finds $c(x)=0$ . Given $H\in\mathbb{R}^{n\times n}$ with $H\succ 0$ , consider $($ recall (8) $)$

[TABLE]

Then, one finds that the optimal solution of problem (23) is $d=0$ if and only if $x$ is a KKT point $($ i.e., first-order stationary point $)$ for problem (1).

Proof.

Suppose the conditions of the lemma hold and let $d$ be the optimal solution of (23). Since $x\in\mathbb{R}^{n}_{\geq 0}$ and $c(x)=0$ , it follows that the zero vector is feasible for (23). In addition, necessary and sufficient optimality conditions for subproblem (23) are that, corresponding to $d\in\mathbb{R}^{n}$ , there exist $y\in\mathbb{R}^{m}$ and $z\in\mathbb{R}^{n}$ such that

[TABLE]

If $d=0$ , then since $c(x)=0$ it follows that $(x,y,z)$ satisfies (4), as desired. On the other hand, if $x$ is a KKT point for (1), then there exist $y\in\mathbb{R}^{m}$ and $z\in\mathbb{R}^{n}$ such that $(x,y,z)$ satisfies (4), which in turn means that $d=0$ along with $(y,z)$ satisfies (24), and this solution is unique since the objective of (23) is strongly convex. ∎

We conclude this subsection by showing that, under common assumptions and given $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , the quantity $\|d_{k}\|_{2}^{2}$ , where $d_{k}\in\mathbb{R}^{n}$ solves subproblem (8) with $g_{k}=\nabla f(x_{k})$ , represents a stationarity measure with respect to (1). (The assumption in the lemma that $H_{k}=H$ for some $H\succ 0$ for all $k\in\mathbb{N}$ could be relaxed; see Remark 1 at the end of this subsection. We consider this case for the sake of brevity.)

Lemma 5.

Suppose that Assumption 1 holds and there exists infinite ${\cal S}\subseteq\mathbb{N}$ such that for some sequence $\{x_{k}\}\subset{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ one finds $\{x_{k}\}_{k\in{\cal S}}\to x_{*}$ for some $x_{*}\in{\cal X}\cap\mathbb{R}^{n}_{\geq 0}$ with $c(x_{*})=0$ and, with the notation in Lemma 2, one finds that

(i)

$-\nabla f(x_{*})=\nabla c(x_{*})y-I_{{\cal A}(x_{*})}^{T}z_{{\cal A}(x_{*})}$ * for some $(y,z_{{\cal A}(x_{*})})\in\mathbb{R}^{m}\times\mathbb{R}^{|{\cal A}(x_{*})|}_{>0}$ and* 2. (ii)

*the following matrix has full row rank: $\begin{bmatrix}\nabla c(x_{*})^{T}\\ I_{{\cal A}(x_{*})}\end{bmatrix}$ . *

Then, with $H_{k}=H$ for some $H\succ 0$ for all $k\in\mathbb{N}$ , and with $d_{k}$ solving (8) with $g_{k}=\nabla f(x_{k})$ for all $k\in\mathbb{N}$ , $x_{*}$ satisfies (4) if and only if $\{\|d_{k}\|_{2}^{2}\}_{k\in{\cal S}}\to 0$ .

Proof.

Letting ${\cal A}_{*}:={\cal A}(x_{*})$ and considering the linear system of equations

[TABLE]

the proof follows under the conditions of the lemma using the same line of deduction as the proof of Lemma 2, which we omit for the sake of brevity. ∎

Remark 1.

One might relax the condition in Lemma 2 that $\mu=\mu_{k}$ for all $k\in\mathbb{N}$ and similarly relax the condition in Lemma 5 that $H_{k}=H\succ 0$ for all $k\in\mathbb{N}$ , such as by requiring merely that $\{\mu_{k}\}_{k\in{\cal S}}$ and $\{H_{k}\}_{k\in{\cal S}}$ have bounded subsequences that converge to some $\mu\in\mathbb{R}_{>0}$ and $H\succ 0$ , respectively. In these cases, the “if and only if” statements would be replaced by an “if” statements, which in fact is all that is needed for our subsequent analysis and discussions. Nevertheless, for brevity in the proofs, we provide the conditions that offer the stronger conclusions in these lemmas.

4.2 General Algorithm Behavior

We now prove generally applicable results that hold for every run of Algorithm 1. Our initial results in this section presume that iteration $k\in\mathbb{N}$ is reached and certain properties hold with respect to algorithmic quantities (e.g., $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ ), although we ultimately prove in Lemma 13 that, in fact, these facts are guaranteed, i.e., they hold for any run for any generated $k\in\mathbb{N}$ . It is worthwhile to emphasize that the results in this section merely require that $g_{k}\in\mathbb{R}^{n}$ for all $k\in\mathbb{N}$ , which means, for example, that Assumption 2 is not needed in this section. All results that depend on the properties and effects of the stochastic gradient estimates are found in the subsequent subsection, i.e., Section 4.3.

Our first lemma follows directly from Lemma 1, so it is stated without proof.

Lemma 6.

Suppose that Assumption 1 holds. Then, in any run of the algorithm such that iteration $k\in\mathbb{N}$ is reached and $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , it holds that $v_{k}=0$ if and only if $x_{k}$ satisfies (5), i.e., $x_{k}$ is either feasible or an infeasible stationary point, whereas $v_{k}\neq 0$ if and only if $\|c_{k}\|_{2}>\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}$ .

Our next result shows that, in any iteration in which the current iterate $x_{k}$ is in the nonnegative orthant and $\tau_{k-1}>0$ , the merit parameter is either kept at the same value or decreased, and, if it is decreased, then it is decreased below a constant fraction times its former value. As in other SQP methods with such a feature, this ensures that if the merit parameter sequence does not vanish (i.e., its limiting value is nonzero), then it eventually remains at a constant positive value; see Lemma 13.

Lemma 7.

Suppose that Assumption 1 holds. In any run of the algorithm such that line 5 of iteration $k\in\mathbb{N}$ is reached, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , and $\tau_{k-1}\in\mathbb{R}_{>0}$ , it holds that $0<\tau_{k}\leq\tau_{k-1}$ , where if $\tau_{k}<\tau_{k-1}$ , then $\tau_{k}\leq(1-\epsilon_{\tau})\tau_{k-1}$ .

Proof.

Consider an arbitrary run in which line 5 of iteration $k\in\mathbb{N}$ is reached, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , and $\tau_{k-1}\in\mathbb{R}_{>0}$ . Let us show that $0<\tau_{k}\leq\tau_{k-1}$ , in which case the fact that $\tau_{k}<\tau_{k-1}$ implies $\tau_{k}\leq(1-\epsilon_{\tau})\tau_{k-1}$ follows from (11). Toward this end, let us next show that $\tau^{\rm trial}_{k}>0$ . By the constraints of (8), (10), and Lemma 6, one finds that $\tau^{\rm trial}_{k}>0$ whenever $\|c_{k}\|_{2}-\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}>0$ . Hence, to show that one always finds $\tau^{\rm trial}_{k}>0$ , all that remains is to consider the case when $\|c_{k}\|_{2}-\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}=0$ . In this case, it follows from Lemma 6 that $v_{k}=0$ , meaning that $d=0$ is feasible for (8). This, in turn, means that $g_{k}^{T}d_{k}+\tfrac{1}{2}d_{k}^{T}H_{k}d_{k}\leq 0$ , so by (10) one finds that $\tau^{\rm trial}_{k}=\infty>0$ . Since it has been shown that $\tau^{\rm trial}_{k}>0$ , the fact that $0<\tau_{k}\leq\tau_{k-1}$ now follows directly from (11), completing the proof. ∎

We now show that the model reduction offered by the computed search direction satisfies a lower bound with the properties stated in our algorithm development.

Lemma 8.

Suppose that Assumptions 1 and 3 hold. In any run of the algorithm such that line 5 is reached in iteration $k\in\mathbb{N}$ , $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , and $\tau_{k}\in\mathbb{R}_{>0}$ , one finds with $\zeta$ from Assumption 3 that

[TABLE]

and, if $d_{k}\neq 0$ , then $\Delta l(x_{k},\tau_{k},g_{k},d_{k})>0$ .

Proof.

Consider an arbitrary run in which line 5 of iteration $k\in\mathbb{N}$ is reached, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , and $\tau_{k}\in\mathbb{R}_{>0}$ . By (9) and Assumption 3, (25) is implied by

[TABLE]

If $g_{k}^{T}d_{k}+\tfrac{1}{2}d_{k}^{T}H_{k}d_{k}\leq 0$ , then (26) holds due to Lemma 6 and the fact that (8) ensures $\nabla c(x_{k})^{T}v_{k}=\nabla c(x_{k})^{T}d_{k}$ . On the other hand, if $g_{k}^{T}d_{k}+\tfrac{1}{2}d_{k}^{T}H_{k}d_{k}>0$ , then one finds by the update of the merit parameter, namely, (10) and (11), that

[TABLE]

from which (26) follows again. Finally, that $d_{k}\neq 0$ implies $\Delta l(x_{k},\tau_{k},g_{k},d_{k})>0$ follows from (25), $\tau_{k}\in\mathbb{R}_{>0}$ , and since $\zeta\in\mathbb{R}_{>0}$ in Assumption 3. ∎

Our next result is that, under the same conditions as our previous lemmas and under the assumption that $\xi_{k-1}\in\mathbb{R}_{>0}$ , the ratio parameter is either kept at the same value or decreased, and, like the merit parameter, if it is decreased, then it is decreased at least below a constant fraction times its previous value.

Lemma 9.

Suppose that Assumptions 1 and 3 hold. In any run of the algorithm such that line 5 is reached in iteration $k\in\mathbb{N}$ , $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , $\tau_{k}\in\mathbb{R}_{>0}$ , and $\xi_{k-1}\in\mathbb{R}_{>0}$ , it holds that $0<\xi_{k}\leq\xi_{k-1}$ , where if $\xi_{k}<\xi_{k-1}$ , then $\xi_{k}\leq(1-\epsilon_{\xi})\xi_{k-1}$ .

Proof.

Consider an arbitrary run in which line 5 of iteration $k\in\mathbb{N}$ is reached, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , $\tau_{k}\in\mathbb{R}_{>0}$ , and $\xi_{k-1}\in\mathbb{R}_{>0}$ . Let us show that $0<\xi_{k}\leq\xi_{k-1}$ , in which case the fact that $\xi_{k}<\xi_{k-1}$ implies $\xi_{k}\leq(1-\epsilon_{\xi})\xi_{k-1}$ follows from (12). Toward this end, observe that if $d_{k}=0$ , then the algorithm sets $\xi_{k}\leftarrow\xi_{k-1}>0$ , which is consistent with the desired conclusion. On the other hand, if $d_{k}\neq 0$ , then by (12), $\tau_{k}\in\mathbb{R}_{>0}$ , Lemma 6, the fact that (8) ensures $\nabla c(x_{k})^{T}v_{k}=\nabla c(x_{k})^{T}d_{k}$ , and Lemma 8,

[TABLE]

Hence, by (12), the desired conclusion follows. ∎

Next, we prove bounds for the step size computed in the algorithm.

Lemma 10.

Suppose that Assumptions 1 and 3 hold. In any run of the algorithm such that line 5 is reached in iteration $k\in\mathbb{N}$ , $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , $\tau_{k}\in\mathbb{R}_{>0}$ , and $\xi_{k}\in\mathbb{R}_{>0}$ , it holds that $0<\alpha_{k}^{\min}\leq\alpha_{k}^{\max}\leq\min\{1,\alpha_{k}^{\varphi}\}$ , and, so, $x_{k+1}\in\mathbb{R}^{n}_{\geq 0}$ .

Proof.

Consider an arbitrary run of the algorithm in which line 5 of iteration $k\in\mathbb{N}$ is reached, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , $\tau_{k}\in\mathbb{R}_{>0}$ , and $\xi_{k}\in\mathbb{R}_{>0}$ . Let us show that $0<\alpha_{k}^{\min}\leq\alpha_{k}^{\max}\leq 1$ , in which case the fact that $x_{k+1}\in\mathbb{R}^{n}_{\geq 0}$ follows from $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , the fact that the constraints of (8) ensure that $x_{k}+d_{k}\in\mathbb{R}^{n}_{\geq 0}$ , and since the step size has $\alpha_{k}\in[\alpha_{k}^{\min},\alpha_{k}^{\max}]\subset(0,1]$ . Toward this end, observe that if $d_{k}=0$ , then the algorithm yields $\alpha_{k}=\alpha_{k}^{\min}=\alpha_{k}^{\max}=\alpha_{k}^{\varphi}=1$ , so the conclusion follows trivially. Hence, let us assume $d_{k}\neq 0$ . Observe that from (13), the algorithm uses $\alpha_{k}^{\min}$ with

[TABLE]

Now observing (15), which shows $\alpha_{k}^{\max}\leq\min\{1,\alpha_{k}^{\varphi}\}$ , one finds that all that remains is to prove that $\alpha_{k}^{\min}\leq\alpha_{k}^{\varphi}$ . For this purpose, let us introduce

[TABLE]

where $\alpha^{\rm suff}_{k}\in(0,1]$ follows by $\beta_{k}\in(0,1]$ , Lemma 8, and $d_{k}\neq 0$ . To show that $\alpha_{k}^{\min}\leq\alpha_{k}^{\varphi}$ , our aim is to show that $\alpha_{k}^{\min}\leq\alpha^{\rm suff}_{k}\leq\alpha_{k}^{\varphi}$ . First, from (12), one finds

[TABLE]

Combining (28) and (29), one finds that $\alpha_{k}^{\min}\leq\alpha^{\rm suff}_{k}$ , as desired. Now, toward proving that $\alpha^{\rm suff}_{k}\leq\alpha_{k}^{\varphi}$ , let us first show that $\varphi_{k}(\alpha^{\rm suff}_{k})\leq 0$ . From the triangle inequality, the fact that $\alpha^{\rm suff}_{k}\in(0,1]$ , and (14), it follows that

[TABLE]

Therefore, by (15), it follows that $\alpha^{\rm suff}_{k}\leq\alpha_{k}^{\varphi}$ . ∎

Our next lemma shows an upper bound on the change in the merit function. In the lemma and throughout the rest of the paper, for any $k\in\mathbb{N}$ such that line 5 is reached we let $d^{\rm true}_{k}\in\mathbb{R}^{n}$ denote the solution of (8) when $g_{k}$ is replaced by $\nabla f(x_{k})$ .

Lemma 11.

Suppose that Assumptions 1 and 3 hold. In any run of the algorithm such that line 5 is reached in iteration $k\in\mathbb{N}$ , $x_{k}\in\mathbb{R}_{\geq 0}$ , $\tau_{k}\in\mathbb{R}_{>0}$ , and $\alpha_{k}\in(0,\alpha_{k}^{\varphi}]$ , it holds that

[TABLE]

Proof.

Consider an arbitrary run of the algorithm in which line 5 of iteration $k\in\mathbb{N}$ is reached, $x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , $\tau_{k}\in\mathbb{R}_{>0}$ , and $\alpha_{k}\in(0,\alpha_{k}^{\varphi}]$ . By Assumption 1 (which led to (3)), (8) (which implies $c_{k}+\nabla c(x_{k})^{T}d_{k}=c_{k}+\nabla c(x_{k})^{T}d^{\rm true}_{k}$ ), (9), (14), and the fact that $0<\alpha_{k}\leq\alpha_{k}^{\varphi}$ (which means $\varphi_{k}(\alpha_{k})\leq 0$ ), it follows that

[TABLE]

which shows the desired conclusion. ∎

We now show that each search direction—and, similarly, the search direction that would be computed if the true gradient of the objective function were used in place of the stochastic gradient estimate—can be viewed as a projection of the unconstrained minimizer of the objective of (8) onto the feasible region of (8).

Lemma 12.

Suppose that Assumptions 1 and 3 hold. In any run of the algorithm such that line 5 is reached in iteration $k\in\mathbb{N}$ , $x_{k}\in\mathbb{R}_{\geq 0}$ , and with

[TABLE]

it holds that $d_{k}=\operatorname{Proj}_{k}(-H_{k}^{-1}g_{k})$ and $d^{\rm true}_{k}=\operatorname{Proj}_{k}(-H_{k}^{-1}\nabla f(x_{k}))$ .

Proof.

Consider an arbitrary run of the algorithm in which line 5 of iteration $k\in\mathbb{N}$ is reached and $x_{k}\in\mathbb{R}_{\geq 0}$ . The desired conclusion follows from the facts that ${\cal D}_{k}$ is convex and, under Assumption 3, $H_{k}$ is SPD; in particular, one finds that

[TABLE]

and similarly with respect to $d^{\rm true}_{k}$ with $g_{k}$ replaced by $\nabla f(x_{k})$ . ∎

We are now prepared to prove the following lemma, which shows that the algorithm is well defined and either terminates finitely with an infeasible stationary point or generates an infinite sequence of iterates with certain critical properties of the simultaneously generated algorithmic sequences. The lemma also reveals that the monotonically nonincreasing merit parameter sequence either vanishes or ultimately remains constant, and it reveals that the monotonically nonincreasing ratio parameter sequence ultimately remains constant at a value that is greater than or equal to a positive real number that is defined uniformly across all runs of the algorithm.

Lemma 13.

Suppose that Assumptions 1 and 3 hold. In any run, either the algorithm terminates finitely with an infeasible stationary point or it performs an infinite number of iterations such that, for all $k\in\mathbb{N}$ , it holds that

(a)

$x_{k}\in\mathbb{R}^{n}_{\geq 0}$ , 2. (b)

$v_{k}=0$ * if and only if $x_{k}$ satisfies (5),* 3. (c)

$v_{k}\neq 0$ * if and only if $\|c_{k}\|_{2}>\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}$ ,* 4. (d)

$0<\tau_{k}\leq\tau_{k-1}<\infty$ , 5. (e)

$\tau_{k}<\tau_{k-1}$ * if and only if $\tau_{k}\leq(1-\epsilon_{\tau})\tau_{k-1}$ ,* 6. (f)

(25) holds, 7. (g)

$d_{k}\neq 0$ * if and only if $\Delta l(x_{k},\tau_{k},g_{k},d_{k})>0$ ,* 8. (h)

$0<\xi_{k}\leq\xi_{k-1}<\infty$ , 9. (i)

$\xi_{k}<\xi_{k-1}$ * if and only if $\xi_{k}\leq(1-\epsilon_{\xi})\xi_{k-1}$ , and* 10. (j)

$0<\alpha_{k}^{\min}\leq\alpha_{k}^{\max}\leq\min\{1,\alpha_{k}^{\varphi}\}$ .

In addition, in any run that does not terminate finitely, it holds that

(k)

either $\{\tau_{k}\}\searrow 0$ or there exists $k_{\tau}\in\mathbb{N}$ and $\tau_{\min}\in\mathbb{R}_{>0}$ such that $\tau_{k}=\tau_{\min}$ for all $k\in\mathbb{N}$ with $k\geq k_{\tau}$ , and 2. (l)

there exist $k_{\xi}\in\mathbb{N}$ and $\xi_{\min}\in\mathbb{R}_{>0}$ with $\xi_{\min}\geq\tfrac{1}{2}\zeta(1-\epsilon_{\xi})$ such that $\xi_{k}=\xi_{\min}$ for all $k\in\mathbb{N}$ with $k\geq k_{\xi}$ .

Proof.

Given the initialization of the algorithm, statements $(a)$ – $(j)$ follow by induction from Lemmas 6–10. Statement $(k)$ follows from statements $(d)$ and $(e)$ . Finally, to prove statement $(l)$ , consider arbitrary $k\in\mathbb{N}$ in a run that does not terminate finitely and note that if $d_{k}=0$ , then $\xi^{\rm trial}_{k}\leftarrow\infty$ , and if $d_{k}\neq 0$ , then $\xi^{\rm trial}_{k}$ satisfies (27), meaning that $\xi^{\rm trial}_{k}\geq\tfrac{1}{2}\zeta$ . Consequently, by (12), $\xi_{k}<\xi_{k-1}$ only if $\xi_{k-1}>\tfrac{1}{2}\zeta$ . This, along with statements $(h)$ and $(i)$ , leads to the conclusion. ∎

4.3 Convergence Guarantees

We now turn to prove convergence results under Assumption 4 below. Recalling the role of $\tfrac{1}{2}\zeta(1-\epsilon_{\xi})\in\mathbb{R}_{>0}$ in Lemma 13 $(l)$ , the assumption focuses on the following event for some $(k_{\min},\tau_{\min},f_{\sup})\in\mathbb{N}\times\mathbb{R}_{>0}\times\mathbb{R}$ , where for all generated $k\in\mathbb{N}$ we denote $\tau^{\rm true,trial}_{k}$ as the value of $\tau^{\rm trial}_{k}$ that would be computed in iteration $k$ if (8) were solved with $\nabla f(x_{k})$ in place of $g_{k}$ :

[TABLE]

The following assumption is made in this subsection. We present a discussion and supporting theoretical results about this assumption in Section 4.4.

Assumption 4.

For some $(k_{\min},\tau_{\min},f_{\sup})\in\mathbb{N}\times\mathbb{R}_{>0}\times\mathbb{R}$ , the event ${\cal E}:={\cal E}(k_{\min},\tau_{\min},f_{\sup})$ occurs and, conditioned on the occurrence of ${\cal E}$ , Assumption 1 holds $($ with the same constants as previously presented in (2) and (3) $)$ .

It is not a shortcoming of our analysis that Assumption 4, through the definition of ${\cal E}$ , assumes that $(i)$ an infinite number of iterations are performed, $(ii)$ the objective value is bounded above in iteration $k_{\min}$ , and $(iii)$ $\{\xi_{k}\}$ ultimately becomes a constant sequence with value at least $\tfrac{1}{2}\zeta(1-\epsilon_{\xi})\in\mathbb{R}_{>0}$ . After all: $(i)$ Lemma 13 shows that the only alternative to an infinite number of iterations being performed is that the algorithm terminates finitely with an infeasible stationary point, in which case there is nothing else to prove; $(ii)$ $f_{\sup}\in\mathbb{R}$ can be arbitrarily large and knowledge of it is not required by the algorithm, so assuming that it exists is a very loose requirement; and $(iii)$ Lemma 13 $(l)$ shows that, in any run that does not terminate finitely, $\{\xi_{k}\}$ is monotonically nonincreasing and bounded below by $\tfrac{1}{2}\zeta(1-\epsilon_{\xi})\in\mathbb{R}_{>0}$ , which is a constant, i.e., it is not run-dependent. Overall, the only important restriction of our analysis in this section is the fact that ${\cal E}$ includes the requirement that $\{\tau_{k}\}$ ultimately becomes constant at a value at least $\tau_{\min}$ that is sufficiently small relative to $\{\tau^{\rm true,trial}_{k}\}$ . This restriction is the subject of Section 4.4.

For the remainder of this subsection, we consider the stochastic process corresponding to the statement of Algorithm 1. Specifically, the sequence

[TABLE]

generated in any run can be viewed as a realization of the stochastic process

[TABLE]

Let ${\cal G}_{1}$ denote the $\sigma$ -algebra defined by the initial conditions of the algorithm and, for all $k\in\mathbb{N}$ with $k\geq 2$ , let ${\cal G}_{k}$ denote the $\sigma$ -algebra generated by the initial conditions and the random variables $\{G_{1},\dots,G_{k-1}\}$ . Then, with respect to the event ${\cal E}$ in Assumption 4, denote the trace $\sigma$ -algebra of ${\cal E}$ on ${\cal G}_{k}$ as ${\cal F}_{k}:={\cal G}_{k}\cap{\cal E}$ for all $k\in\mathbb{N}$ . It follows that $\{{\cal F}_{k}\}$ is a filtration, and we proceed in our analysis under Assumptions 2, 3, and 4 (which subsumes Assumption 1) with the definitions

[TABLE]

(where $P_{\omega}$ denotes probability taken with respect to the distribution of $\omega$ ). We also define, with respect to ${\cal E}$ , the random variables $K^{\prime}\leq k_{\min}$ , ${\cal T}^{\prime}\geq\tau_{\min}$ , and $\Xi^{\prime}\geq\tfrac{1}{2}\zeta(1-\epsilon_{\xi})$ , which for a given run of the algorithm have the realized values $k^{\prime}$ , $\tau^{\prime}$ , and $\xi^{\prime}$ , respectively, defined in (30). Conditioned on ${\cal E}$ , one has in any run that

[TABLE]

and one has that ${\cal T}^{\prime}$ and $\Xi^{\prime}$ are ${\cal F}_{k}$ -measurable for $k=k_{\min}\geq K^{\prime}$ .

Our next lemma shows upper bounds on the norm of the difference between the computed search direction and the search direction that would be computed with the true gradient of the objective. (The conclusion of this lemma and the following one would hold even without assuming that the event ${\cal E}$ occurs, but in each result we condition on ${\cal F}_{k}:={\cal G}_{k}\cap{\cal E}$ for use in our ultimate results under ${\cal E}$ .)

Lemma 14.

Suppose that Assumptions 2, 3, and 4 hold. For all $k\in\mathbb{N}$ ,

[TABLE]

Proof.

Consider arbitrary $k\in\mathbb{N}$ under the stated conditions. Lemma 12 and the obtuse angle lemma for projections [6, Proposition 1.1.9] imply

[TABLE]

Summing these inequalities yields

[TABLE]

Hence, by the Cauchy–Schwarz inequality, it follows that

[TABLE]

which shows under Assumption 3 that $\|D_{k}-D^{\rm true}_{k}\|_{2}\leq\zeta^{-1}\|G_{k}-\nabla f(X_{k})\|_{2}$ , as desired. Then, from this inequality, Assumption 2, and Jensen’s inequality, one has

[TABLE]

from which the remainder of the conclusion follows. ∎

We now show an upper bound on the expected difference between inner products involving the true and stochastic gradients and the true and stochastic directions.

Lemma 15.

Suppose that Assumptions 2, 3, and 4 hold. For all $k\geq k_{\min}$ ,

[TABLE]

Proof.

Consider arbitrary $k\geq k_{\min}$ under the stated conditions. From the triangle and Cauchy–Schwarz inequalities and Lemma 14, it holds that

[TABLE]

which gives the first result. Then, for $k\geq k_{\min}$ , (9) and the equation above give

[TABLE]

which completes the proof. ∎

Our next lemma shows a lower bound on the true model reduction. In the lemma and our subsequent results, we define $J_{k}:=\nabla c(X_{k})^{T}$ for the sake of brevity.

Lemma 16.

Suppose that Assumptions 2, 3, and 4 hold. For all $k\geq k_{\min}$ ,

[TABLE]

Proof.

Consider arbitrary $k\geq k_{\min}$ under the stated conditions. By (9), the fact that ${\cal T}_{k}={\cal T}^{\prime}$ , and Assumption 3, the first desired conclusion is implied by

[TABLE]

If $\nabla f(X_{k})^{T}D^{\rm true}_{k}+\tfrac{1}{2}(D^{\rm true}_{k})^{T}H_{k}D^{\rm true}_{k}\leq 0$ , then the above holds due to Lemma 13 and the fact that $J_{k}D^{\rm true}_{k}=J_{k}V_{k}$ ; else, $\nabla f(X_{k})^{T}D^{\rm true}_{k}+\tfrac{1}{2}(D^{\rm true}_{k})^{T}H_{k}D^{\rm true}_{k}>0$ , in which case one finds from the conditions of the lemma, (10), and (11) that

[TABLE]

from which the displayed inequality above follows again. Finally, the remaining desired conclusion follows from the first conclusion, Lemma 13, and $J_{k}D^{\rm true}_{k}=J_{k}V_{k}$ . ∎

Next, we prove a critical upper bound on the expected value of the second term on the right-hand side of the upper bound proved in Lemma 11.

Lemma 17.

Suppose that Assumptions 2, 3, and 4 hold. For all $k\geq k_{\min}$ ,

[TABLE]

Proof.

For arbitrary $k\geq k_{\min}$ under the conditions, (13) and (15) yield

[TABLE]

Letting ${\cal P}_{k}$ denote the event that $\nabla f(X_{k})^{T}(D_{k}-D^{\rm true}_{k})\geq 0$ and letting ${\cal P}_{k}^{c}$ denote the event that $\nabla f(X_{k})^{T}(D_{k}-D^{\rm true}_{k})<0$ , the law of total expectation and the fact that ${\cal T}^{\prime}$ and $\Xi^{\prime}$ are ${\cal F}_{k}$ -measurable for $k\geq k_{\min}$ shows that

[TABLE]

The Cauchy-Schwarz inequality and law of total expectation show that

[TABLE]

so from above, the Cauchy-Schwarz inequality, Assumption 4, and Lemma 14,

[TABLE]

which gives the desired conclusion. ∎

We now present, as a lemma, results pertaining to the asymptotic behavior of the model reductions generated by the algorithm. In the subsequent theorem after the lemma, these results will be translated in terms of quantities that, as seen in Section 4.1, can be connected to stationarity measures related to problem (1). We remark that the conditions of the lemma can be satisfied in a run-dependent manner if, every time the merit or ratio parameter is decreased, say in iteration $\hat{k}\in\mathbb{N}$ , the sequence $\{\beta_{k}\}$ is “restarted” such that with $\alpha^{\prime}=2(1-\eta)\xi_{\hat{k}}\tau_{\hat{k}}/(\tau_{\hat{k}}L+\Gamma)$ and some (run-independent) $\psi\in(0,1]$ one chooses $\beta_{k}=\beta=\psi\frac{\alpha^{\prime}}{2(1-\eta)(\alpha^{\prime}+\theta)}$ for part (a) of the lemma and $\beta_{k}=\frac{1}{k-\hat{k}+1}\psi\frac{\alpha^{\prime}}{2(1-\eta)(\alpha^{\prime}+\theta)}$ for part (b); such a scheme was described in [1] as well. Notice that in this situation, $\beta$ and $\{\beta_{k}\}_{k\geq\hat{k}}$ in parts (a) and (b), respectively, are random variables, but, importantly, they are ${\cal F}_{k}$ -measurable for $k\geq k_{\min}$ . Alternatively, one could choose $\{\beta_{k}\}$ using the same formulas, but with $\xi_{\min}$ and $\tau_{\min}$ in place of $\xi_{k}$ and $\tau_{k}$ , respectively, in the formula for $\alpha^{\prime}$ , in which case the choices are run-independent. The downside of relying on this latter situation is that it requires knowledge of $\xi_{\min}$ and $\tau_{\min}$ , which would not typically be known a priori. Hence, we analyze the former scheme, but use run-dependent bounds that, under ${\cal E}$ , are defined with respect to $\xi_{\min}$ and $\tau_{\min}$ (even though these values are unknown).

We also remark that for case (a) in the following lemma, the sequence $\{\rho_{k}\}$ , which bounds the expected squared error in the stochastic gradient estimates, can be a constant sequence. However, for case (b), the relationship between $\{\rho_{k}\}$ and $\{\beta_{k}\}$ means that the expected squared error in the gradient estimates must vanish as $k\to\infty$ . This requirement, which is stronger than the requirement for equality-constraints-only case in [1], is needed to overcome the fact that in the presence of bound constraints the search directions can be biased estimates of their true counterparts.

Lemma 18.

Under Assumptions 2, 3, and 4, suppose that $\{\rho_{k}\}$ is chosen such that there exists $\iota\in\mathbb{R}_{>0}$ with $\rho_{k}\leq\iota\beta_{k}^{2}$ for all $k\in\mathbb{N}$ with $k\geq k_{\min}$ , and define

[TABLE]

Then, with ${\cal A}^{\prime}$ defined in (32) and $\mathbb{E}[\cdot|{\cal E}]$ denoting total expectation over all realizations of the algorithm conditioned on event ${\cal E}$ , the following statements hold true.

(a)

if $\beta_{k}=\beta=\psi\frac{{\cal A}^{\prime}}{2(1-\eta)({\cal A}^{\prime}+\theta)}$ for some $\psi\in(0,1]$ for all $k\geq k_{\min}$ , then

[TABLE]

(b)

if $\sum_{k=k_{\min}}^{\infty}\beta_{k}=\infty$ , $\sum_{k=k_{\min}}^{\infty}\beta_{k}^{2}<\infty$ , and $\beta_{k}\leq\psi\frac{{\cal A}^{\prime}}{2(1-\eta)({\cal A}^{\prime}+\theta)}$ for some $\psi\in(0,1]$ for all $k\geq k_{\min}$ , it holds that

[TABLE]

Proof.

For arbitrary $k\geq k_{\min}$ under the conditions, it follows from Lemma 11, Lemma 16 (which shows $\Delta l(X_{k},{\cal T}_{k},\nabla f(X_{k}),D^{\rm true}_{k})\geq 0$ ), (32), the fact that ${\cal A}_{k}\geq{\cal A}_{k}^{\min}={\cal A}^{\prime}\beta_{k}$ , Lemma 17, the fact that ${\cal A}_{k}\leq{\cal A}_{k}^{\max}\leq{\cal A}_{k}^{\min}+\theta\beta_{k}=({\cal A}^{\prime}+\theta)\beta_{k}$ , Lemma 14, Lemma 15, and $\beta_{k}\in(0,1]$ that

[TABLE]

where $R^{\prime}=({\cal A}^{\prime}+\theta){\cal T}^{\prime}\zeta^{-1}(\kappa_{\nabla f}\sqrt{\iota}+(1-\eta)(\iota+\kappa_{\nabla f}\sqrt{\iota}))$ . Now, from Assumption 4 (which subsumes Assumption 1), there exists $\phi_{\min}\in\mathbb{R}$ such that $\phi(X_{k},{\cal T}^{\prime})\geq\phi_{\min}$ for all $k\geq k_{\min}$ . One also finds that $\alpha_{\min}^{\prime}\leq{\cal A}^{\prime}\leq\alpha_{\max}^{\prime}$ due to the monotonicity of $\frac{2(1-\eta)\Xi^{\prime}\tau}{\tau L+\Gamma}$ with respect to $\tau$ . Therefore, under part $(a)$ of the lemma, in which case one finds for $k\geq k_{\min}$ that $\psi\frac{\alpha_{\min}^{\prime}}{2(1-\eta)(\alpha_{\min}^{\prime}+\theta)}\leq\beta\leq\psi\frac{\alpha_{\max}^{\prime}}{2(1-\eta)(\alpha_{\max}^{\prime}+\theta)}$ , it follows from above that

[TABLE]

so by taking total expectation conditioned on the event ${\cal E}$ one finds

[TABLE]

Rearranging terms, observing that $\mathbb{E}[\phi(X_{k_{\min}},{\cal T}^{\prime})|{\cal E}]$ is bounded above under Assumption 4, and considering the limit superior as $k\to\infty$ , the conclusion of part $(a)$ follows. On the other hand, under the conditions of part $(b)$ , it follows in a similar manner that, for any $k\in\mathbb{N}$ , one finds

[TABLE]

Taking limits as $k\to\infty$ , the conclusion of part $(b)$ follows. ∎

We now present our main convergence theorem for Algorithm 1, which is essentially a translation of Lemma 18 from results about model reductions to results about quantities connected to measures of stationarity for problem (1).

Theorem 1.

Suppose the conditions of Lemma 18 hold. Then,

(a)

under the conditions of Lemma 18(a), there exists $C\in\mathbb{R}_{>0}$ such that

[TABLE]

(b)

under the conditions of Lemma 18(b), with $B_{k}:=\sum_{j=k_{\min}}^{k_{\min}+k-1}\beta_{j}$ ,

[TABLE]

which further implies $\liminf_{k\to\infty}\ \mathbb{E}[\|D^{\rm true}_{k}\|_{2}^{2}+(\|c(X_{k})\|_{2}-\|c(X_{k})+J_{k}D^{\rm true}_{k}\|_{2})|{\cal E}]=0$ .

Proof.

The desired conclusions follow from Lemmas 16 and 18. ∎

One might be able to strengthen the conclusion in Theorem 1(b), say to an almost-sure convergence guarantee; see, e.g., [8]. However, we are satisfied with Theorem 1(b), which is sufficient for revealing the favorable properties of Algorithm 1 under Assumptions 2, 3, and 4. Theorem 1(a) shows under Assumptions 2, 3, and 4 that if the latter condition in (6) holds with $\rho_{k}=\rho$ for some $\rho\in\mathbb{R}_{>0}$ for all $k\in\mathbb{N}$ and $\{\beta_{k}\}=\{\beta\}$ is chosen as a (sufficiently small) constant sequence, then the limit superior of the expectation of the average of quantities connected to stationarity measures for problem (1) is bounded above by a constant proportional to $\beta$ . Intuitively, this shows that the iterates generated by the algorithm ultimately remain in a region in which these stationarity measures are small. On the other hand, Theorem 1(b) shows under Assumption 4 that if $\{\rho_{k}\}$ and $\{\beta_{k}\}$ vanish with $\rho_{k}={\cal O}(\beta_{k}^{2})$ , then a subsequence of iterates exist over which the expected values of these stationarity measures vanish. As seen in Lemma 3, if there exists a subsequence of iterates, say indexed by ${\cal S}\subseteq\mathbb{N}$ , that converges to a point satisfying certain regularity conditions, then $\{\|c_{k}\|_{2}-\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}\}_{k\in{\cal S}}\to 0$ means that the limit point is stationary with respect to the problem to minimize $\tfrac{1}{2}\|c(x)\|_{2}^{2}$ subject to $x\in\mathbb{R}^{n}_{\geq 0}$ . Similarly, as seen in Lemma 5, if there exists such a subsequence and the limit point is feasible with respect to problem (1), then $\{d^{\rm true}_{k}\}_{k\in{\cal S}}\to 0$ means that the limit point is stationary with respect to (1). These situations are not guaranteed to occur, but this discussion shows that Theorem 1 is meaningful.

4.4 Non-vanishing Merit Parameter

Our main convergence result in the previous section, namely, Theorem 1, requires Assumption 4, which in turn requires that the merit parameter sequence ultimately becomes a sufficiently small, positive constant sequence. (Recall the discussion after Assumption 4.) To show that this corresponds to a realistic event for practical purposes, we next show conditions under which one finds that the merit parameter would not vanish.

We begin by showing a generally applicable result about the solution of (7). It is related to that in Lemma 3, but is stronger due to an additional assumption.

Lemma 19.

Suppose the conditions of Lemma 3 hold and there exists $\kappa_{w}\in[0,1)$ such that for all generated $k\in\mathbb{N}$ in any run of the algorithm one has $\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}\leq\kappa_{w}\|c_{k}\|_{2}$ . Then, there exists $\kappa_{v}\in\mathbb{R}_{>0}$ such that, in any run of the algorithm such that iteration $k\in\mathbb{N}$ is reached, one finds

[TABLE]

Proof.

Consider an arbitrary run of the algorithm in which the conditions of the lemma hold and iteration $k\in\mathbb{N}$ is reached. If $c_{k}=0$ , then it follows by construction of (7) that $v_{k}=0$ , in which case (33) follows trivially. Hence, we may proceed under the assumption that $c_{k}\neq 0$ , which by the conditions of the lemma, Assumption 1 (see (2)), and the triangle inequality gives

[TABLE]

Consequently, from (21), (22), and a similar derivation as in Lemma 3, one finds

[TABLE]

from which the desired conclusion in (33) follows. ∎

We now show that, under common conditions and when the norm of the stochastic gradient estimate is bounded uniformly, the denominator of the formula for $\tau^{\rm trial}_{k}$ in (10) is bounded proportionally to $\|v_{k}\|_{2}$ .

Lemma 20.

Suppose that Assumptions 1 and 3 hold, and that there exists $(\lambda,\mu,\kappa_{g})\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ such that for all generated $k\in\mathbb{N}$ in any run of the algorithm one has $\nabla c(x_{k})^{T}\nabla c(x_{k})\succeq\lambda I$ , $\mu_{k}\geq\mu$ , and $\|g_{k}\|_{2}\leq\kappa_{g}$ . Then, there exists $\kappa_{g,H}\in\mathbb{R}_{>0}$ such that, in any run such that iteration $k\in\mathbb{N}$ is reached, one finds

[TABLE]

Proof.

Consider an arbitrary run in which the conditions of the lemma hold and iteration $k\in\mathbb{N}$ is reached. By Lemma 13, $(u,w)=(0,0)$ is feasible for (7), so

[TABLE]

Since $\tfrac{1}{2}\|c_{k}+\nabla c(x_{k})^{T}\nabla c(x_{k})w_{k}\|_{2}^{2}\leq\tfrac{1}{2}\|c_{k}\|_{2}^{2}$ , it follows that

[TABLE]

which along with Assumption 1 (see (2)) shows that

[TABLE]

On the other hand, since $\tfrac{1}{2}\mu_{k}\|u_{k}\|_{2}^{2}\leq\tfrac{1}{2}\|c_{k}\|_{2}^{2}$ , it follows under Assumption 1 that $\|u_{k}\|_{2}\leq\tfrac{1}{\sqrt{\mu_{k}}}\|c_{k}\|_{2}\leq\tfrac{1}{\sqrt{\mu}}\kappa_{c}$ . Therefore, overall, it follows that

[TABLE]

Now, since $v_{k}=\nabla c(x_{k})w_{k}+u_{k}$ is a feasible solution of (8) while $d_{k}$ is the optimal solution of (8), it follows under the conditions of the lemma that

[TABLE]

which leads to the desired conclusion. ∎

We now prove conditions under which the merit parameter does not vanish.

Theorem 2.

Suppose that Assumptions 1 and 3 hold, and that there exists $(\lambda,\mu,\kappa_{g},\kappa_{w})\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times[0,1)$ such that for all generated $k\in\mathbb{N}$ in any run of the algorithm one has $\nabla c(x_{k})^{T}\nabla c(x_{k})\succeq\lambda I$ , $\mu_{k}\geq\mu$ , $\|g_{k}\|_{2}\leq\kappa_{g}$ , and $\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}\leq\kappa_{w}\|c_{k}\|_{2}$ . Then, in any run that does not terminate finitely, the latter event in Lemma 13 $(k)$ occurs $($ i.e., $\{\tau_{k}\}$ does not vanish $)$ with $\tau_{\min}\geq\tfrac{(1-\sigma)\kappa_{v}}{\kappa_{g,H}}(1-\epsilon_{\tau})$ .

Proof.

Consider arbitrary $k\in\mathbb{N}$ in a run that does not terminate finitely and note that if $d_{k}=0$ or $g_{k}^{T}d_{k}+\tfrac{1}{2}d_{k}^{T}H_{k}d_{k}\leq 0$ , then $\tau^{\rm trial}_{k}\leftarrow\infty$ , and otherwise $\tau^{\rm trial}_{k}$ is set by (10). Hence, under the conditions of the lemma and by Lemmas 19–20,

[TABLE]

Consequently, by the merit parameter update in (11), $\tau_{k}<\tau_{k-1}$ only if $\tau_{k-1}>\tau_{*}$ . This, along with Lemma 13 $(d)$ – $(e)$ , leads to the conclusion. ∎

Since $\nabla f$ is bounded in norm over the set ${\cal X}$ in Assumption 1, Theorem 2 shows that, amongst the other stated conditions, if $\|g_{k}-\nabla f(x_{k})\|_{2}$ is bounded uniformly over all $k\in\mathbb{N}$ in any, then the merit parameter sequence always remains bounded below by a positive number. Under such conditions, the only potentially poor behavior of the merit parameter sequence is that, in a given run, it ultimately remains constant at a value that is too large. We claim that, under certain assumptions about the distribution of the stochastic gradient estimates, this behavior can be shown to occur with probability zero. (We do not prove such a result here, but refer the interested reader to Proposition 3.16 in [1] to see such a result for the equality-constraints-only setting, in which case the behavior of the merit parameter is similar.) On the other hand, if $\|g_{k}-\nabla f(x_{k})\|_{2}$ is not bounded uniformly in this manner, then it is possible for the merit parameter sequence to vanish unnecessarily. This issue is one that should be noted by a user of the algorithm. In particular, if in a run of the algorithm one chooses $\mu_{k}\geq\mu$ for some $\mu\in\mathbb{R}_{>0}$ for all $k\in\mathbb{N}$ and one finds for some $(\lambda,\kappa_{w})\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ that generated $k\in\mathbb{N}$ yield $\nabla c(x_{k})^{T}\nabla c(x_{k})\succeq\lambda I$ and $\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}\leq\kappa_{w}\|c_{k}\|_{2}$ , yet $\tau_{k}$ has become exceedingly small, then Theorem 2 shows that this must be due to the stochastic gradient estimates tending to become significantly large in norm, in which case the performance of the algorithm may improve with more accurate stochastic gradient estimates.

4.5 Deterministic Algorithm

We conclude this section with a statement of a convergence result that we claim to hold for Algorithm 1 if it were to be run with $g_{k}=\nabla f(x_{k})$ for all $k\in\mathbb{N}$ . Due to space considerations, we do not provide a proof of the result, although we offer the proposition for reference for the reader and claim that it holds from results proved in this paper for the stochastic setting as well as other similar results for SQP methods for deterministic continuous nonlinear optimization.

Proposition 1.

Suppose Assumptions 1 and 3 hold and Algorithm 1 is run with $g_{k}=\nabla f(x_{k})$ for all $k\in\mathbb{N}$ . If for all large $k\in\mathbb{N}$ there exists $\kappa_{w}\in[0,1)$ such that $\|c_{k}+\nabla c(x_{k})^{T}v_{k}\|_{2}\leq\kappa_{w}\|c_{k}\|_{2}$ , then $\{x_{k}\}\subset\mathbb{R}^{n}_{\geq 0}$ , $\{\tau_{k}\}$ is bounded away from zero, and, with $y_{k}\in\mathbb{R}^{m}$ and $z_{k}\in\mathbb{R}^{n}_{\geq 0}$ defined as the optimal multipliers corresponding to the solution of subproblem (8) for all $k\in\mathbb{N}$ , it follows that

[TABLE]

Otherwise, $\{x_{k}\}\subset\mathbb{R}^{n}_{\geq 0}$ , $\{\min\{\nabla c(x_{k})c_{k},0\}\}\to 0$ , and $\{|x_{k}^{T}\nabla c(x_{k})c_{k}|\}\to 0$ , and if the sequence $\{\tau_{k}\}$ is bounded away from zero, then

[TABLE]

5 Numerical Results

In this section, we provide results demonstrating the performance of a MATLAB implementation of Algorithm 1 when solving a subset of problems from CUTEst [22], where Gurobi is used to solve the arising subproblems [23]. The purpose of these experiments is to compare this performance against that of the Julia implementation provided by the authors of [31, Algorithm 1]. From all inequality-constrained problems in CUTEst, we selected those such that (i) $m\leq n\leq 1000$ , (ii) $f(x_{k})\geq-10^{20}$ for all $k\in\mathbb{N}$ in all runs of our algorithm, and (iii) Gurobi did not report any errors. This resulted in a set of 323 test problems.

For each test problem, both codes used the same initial iterate and generated stochastic gradient estimates in the same manner. Specifically, for all $k\in\mathbb{N}$ in each run, the codes set $g_{k}={\cal N}(\nabla f(x_{k}),\epsilon_{g}(I+ee^{T}))$ , where $e$ is the all-ones vector and $\epsilon_{g}\in\{10^{-8},10^{-4},10^{-2},10^{-1}\}$ was fixed for each run (see below). If a problem had only inequality constraints, i.e., $m=0$ , then our code explicitly computed $\alpha_{k}^{\varphi}$ (as defined in (15)) and set $\alpha_{k}\leftarrow\alpha_{k}^{\max}$ for all $k\in\mathbb{N}$ . Otherwise, the code set $\alpha_{k}\leftarrow\min\{1,(1.1)^{t_{k}}\alpha_{k}^{\min},\alpha_{k}^{\min}+\theta\beta_{k}\}$ , where $t_{k}\leftarrow\max\{t\in\mathbb{N}:\varphi_{k}((1.1)^{t}\alpha_{k}^{\min})\leq 0\}$ . This guarantees that $\alpha_{k}\in[\alpha_{k}^{\min},\alpha_{k}^{\max}]$ for all $k\in\mathbb{N}$ . The other user-defined parameters of Algorithm 1 were selected as $\sigma=\tau_{0}=0.1$ , $\eta=0.5$ , $\xi_{0}=1$ , $\epsilon_{\tau}=\epsilon_{\xi}=10^{-2}$ , $\theta=10^{4}$ , $\mu_{k}=\max\{10^{-8},10^{-4}\|c_{k}\|_{2}^{2}\}$ , $\beta_{k}=1$ , and $H_{k}=I$ for all $k\in\mathbb{N}$ . The Lipschitz constants $L$ and $\Gamma$ were estimated every 100 iterations by differences of stochastic gradients at ten samples around the current iterate. Meanwhile, we ran the Julia code for [31, Algorithm 1] with the AdapGD option and its default parameter settings as described in [31, Section 4]. Each code terminated as soon as $10^{4}$ stochastic gradient samples were evaluated or a 12-hour CPU time limit was reached.

Let $\texttt{FeasErr}(x)$ be the $\infty$ -norm constraint violation at $x$ and let $\texttt{KKTErr}(x,y,z)$ be the $\infty$ -norm violation of the KKT conditions (recall (4)) at a primal-dual iterate $(x,y,z)$ . Each run of Algorithm 1 generates $\{x_{k}\}\subset\mathbb{R}^{n}$ . For each $k\in\mathbb{N}$ , let $y_{k}^{\rm true}\in\mathbb{R}^{m}$ and $z_{k}^{\rm true}\in\mathbb{R}^{n}$ denote the optimal Lagrange multipliers corresponding to the equality and inequality constraints when (8) is solved with $g_{k}=\nabla f(x_{k})$ . For each run of Algorithm 1, we determined the best iterate as $x_{k_{\texttt{best}}}$ where

[TABLE]

We determined the best iterate in a run of [31, Algorithm 1] using the same formula with the sequence of iterates and Lagrange multiplier estimates that are computed as part of the algorithm. Our results for four noise levels, provided in Figure 1 below, are presented in terms of $\texttt{FeasErr}(x_{k_{\texttt{best}}})$ as the feasibility error and $\texttt{KKTErr}(x_{k_{\texttt{best}}},y_{{k_{\texttt{best}}}}^{\rm true},z_{{k_{\texttt{best}}}}^{\rm true})$ as the KKT error for each run of both algorithms.

Since the Julia code for [31, Algorithm 1] is only set up to solve CUTEst problems without simple bound constraints, the results in Figure 1 are presented in two parts. For the 57 problems for which both algorithms were set up to run, the first two box plots show the best feasibility and KKT errors achieved by both codes, where each problem is run 5 times each (since the behaviors of the algorithms are stochastic). In the third box plot, we report the best feasibility and KKT errors obtained by our Matlab code on the remaining $266$ $(=323-57)$ problems, again with five runs for each problem. Overall, one finds that the performance of our algorithm is comparatively good in this experimental set-up. The best feasibility and KKT errors are relatively low for our algorithm, although the errors increase with the noise level, as may be expected. Experiments with diminishing step sizes also showed favorable performance for our algorithm; these results are omitted due to page limit restrictions.

6 Conclusion

We have proposed, analyzed, and tested an algorithm for solving continuous optimization problems. The algorithm requires that constraint function and derivative values can be computed in each iteration, but does not require exact objective function and derivative values; rather, the algorithm merely requires that a stochastic objective gradient estimate is computed to satisfy relatively loose assumptions in each iteration. The theoretical convergence guarantees of the algorithm require knowledge of Lipschitz constants for the objective gradient and constraint Jacobian, although in practice these constants can be estimated. Our numerical experiments show that our proposed algorithm can outperform an alternative algorithm that relies on the ability to compute more accurate gradient estimates. We have provided comments throughout the paper on how the assumptions that are required for our theoretical convergence guarantees might be loosened further.

Acknowledgements

The authors are grateful to Sen Na for providing consultation about the Julia implementation provided by the authors of [31, Algorithm 1]. This material is based upon work supported by the U.S. NSF under award CCF-2139735 and by the Office of Naval Research under award N00014-21-1-2532.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. S. Berahas, F. E. Curtis, D. Robinson, and B. Zhou. Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM Journal on Optimization , 31(2):1352–1379, 2021.
2[2] Albert S Berahas, Raghu Bollapragada, and Baoyu Zhou. An adaptive sampling sequential quadratic programming method for equality constrained stochastic optimization. ar Xiv preprint ar Xiv:2206.00712 , 2022.
3[3] Albert S Berahas, Frank E Curtis, Michael J O’Neill, and Daniel P Robinson. A stochastic sequential quadratic optimization algorithm for nonlinear equality constrained optimization with rank-deficient jacobians. ar Xiv preprint ar Xiv:2106.13015 , 2021.
4[4] Albert S Berahas, Jiahao Shi, Zihong Yi, and Baoyu Zhou. Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction. ar Xiv preprint ar Xiv:2204.04161 , 2022.
5[5] Albert S Berahas, Miaolan Xie, and Baoyu Zhou. A sequential quadratic programming method with high probability complexity bounds for nonlinear equality constrained stochastic optimization. ar Xiv preprint ar Xiv:2301.00477 , 2023.
6[6] Dimitri Bertsekas. Convex Optimization Theory , volume 1. Athena Scientific, 2009.
7[7] Dimitri P. Bertsekas. Network optimization: continuous and discrete models , volume 8. Athena Scientific, 1998.
8[8] Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization , 10(3):627–642, 2000.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Sequential Quadratic Optimization for Stochastic Optimization with Deterministic Nonlinear Inequality and Equality Constraints

1 Introduction

1.1 Contributions

1.2 Notation

1.3 Organization

2 Setting

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

3 Algorithm

4 Analysis

4.1 Subproblems and Stationarity Measures

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Remark 1**.**

4.2 General Algorithm Behavior

Lemma 6**.**

Lemma 7**.**

Proof.

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Lemma 13**.**

Proof.

4.3 Convergence Guarantees

Assumption 4**.**

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Lemma 16**.**

Proof.

Lemma 17**.**

Proof.

Lemma 18**.**

Proof.

Theorem 1**.**

Proof.

4.4 Non-vanishing Merit Parameter

Lemma 19**.**

Proof.

Lemma 20**.**

Proof.

Theorem 2**.**

Proof.

4.5 Deterministic Algorithm

Proposition 1**.**

5 Numerical Results

6 Conclusion

Acknowledgements

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Remark 1.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Assumption 4.

Lemma 14.

Lemma 15.

Lemma 16.

Lemma 17.

Lemma 18.

Theorem 1.

Lemma 19.

Lemma 20.

Theorem 2.

Proposition 1.