Inexact restoration with subsampled trust-region methods for finite-sum   minimization

Stefania Bellavia; Natasa Krejic; Benedetta Morini

arXiv:1902.01710·math.OC·May 12, 2020·Comput. Optim. Appl.

Inexact restoration with subsampled trust-region methods for finite-sum minimization

Stefania Bellavia, Natasa Krejic, Benedetta Morini

PDF

TL;DR

This paper introduces a new trust-region method for finite-sum minimization that uses deterministic subsampling for approximating functions, gradients, and Hessians, improving efficiency over standard methods.

Contribution

It proposes a novel inexact restoration-based trust-region approach with deterministic sample size control for better computational efficiency.

Findings

01

More efficient than standard trust-region methods with subsampled Hessians.

02

Provides local and global convergence properties for approximate optimal points.

03

Achieves favorable function evaluation complexity results.

Abstract

Convex and nonconvex finite-sum minimization arises in many scientific computing and machine learning applications. Recently, first-order and second-order methods where objective functions, gradients and Hessians are approximated by randomly sampling components of the sum have received great attention. We propose a new trust-region method which employs suitable approximations of the objective function, gradient and Hessian built via random subsampling techniques. The choice of the sample size is deterministic and ruled by the inexact restoration approach. We discuss local and global properties for finding approximate first- and second-order optimal points and function evaluation complexity results. Numerical experience shows that the new procedure is more efficient, in terms of overall computational cost, than the standard trust-region scheme with subsampled Hessians.

Tables2

Table 1. Table 1: Data sets used

	Training set		Testing set
Data set	$N$	$n$	$N_{T}$
Mushrooms [24]	5000	112	3124
Cina0 [14]	10000	132	6033
Gisette [24]	5000	5000	1000
A9a [24]	22793	123	9768
Covertype [24]	464810	54	116202
Ijcnn1 [15]	49990	22	91701
Mnist [23]	60000	784	10000
Htru2[24]	10000	8	7898

Table 2. Table 2: Function evaluations performed by iretr_d , iretr_gg , statr_sh and statr_fh and saving obtained by iretr_d over iretr_gg , statr_sh and statr_fh .

Data set	nfe	nfe(save)
	iretr_d	iretr_gg	statr_sh	statr_fh
Mushrooms	27	30 (10%)	51 (47%)	108 (75%)
Cina0	88	99 (11%)	96 (08%)	416 (78%)
Gisette	346	362 (04%)	432 (20%)	594 (42%)
A9a	22	25 (12%)	45 (51%)	445 (95%)
Covertype	17	23 (26%)	48 (65%)	698 (98%)
Ijcnn1	20	25 (20%)	36 (44%)	128 (84%)
Mnist	46	50 (08%)	58 (20%)	955 (95%)
Htru2	38	37 ( -3%)	43 (12%)	87 (56%)

Equations268

x \in I R^{n} min f_{N} (x) = \frac{1}{N} i = 1 \sum N ϕ_{i} (x),

x \in I R^{n} min f_{N} (x) = \frac{1}{N} i = 1 \sum N ϕ_{i} (x),

x \in I R^{n} min f_{M} (x) = \frac{1}{M} i \in I_{M} \sum ϕ_{i} (x) .

x \in I R^{n} min f_{M} (x) = \frac{1}{M} i \in I_{M} \sum ϕ_{i} (x) .

\mbox s . t . M = N,

I_{M} \subseteq {1, \dots, N}, ∣ I_{M} ∣ = M, M \geq 1,

I_{M} \subseteq {1, \dots, N}, ∣ I_{M} ∣ = M, M \geq 1,

\underline{h} \leq h (M) \mbox i f 0 < M < N, \mbox an d h (M) \leq \overset{ˉ}{h} \mbox i f 0 < M \leq N,

\underline{h} \leq h (M) \mbox i f 0 < M < N, \mbox an d h (M) \leq \overset{ˉ}{h} \mbox i f 0 < M \leq N,

Ψ (x, M, θ) = θ f_{M} (x) + (1 - θ) h (M),

Ψ (x, M, θ) = θ f_{M} (x) + (1 - θ) h (M),

m_{k} (p) = f_{N_{k + 1}} (x_{k}) + \nabla f_{N_{k + 1}} (x_{k})^{T} p + \frac{1}{2} p^{T} B_{k + 1} p,

m_{k} (p) = f_{N_{k + 1}} (x_{k}) + \nabla f_{N_{k + 1}} (x_{k})^{T} p + \frac{1}{2} p^{T} B_{k + 1} p,

∥ p ∥ \leq Δ_{k} min m_{k} (p) .

∥ p ∥ \leq Δ_{k} min m_{k} (p) .

\displaystyle p_{k}^{C}=\mathop{\rm argmin}_{\small\begin{array}[]{c}p=-t\nabla f_{{N_{k+1}}}(x_{k}),\,t>0\\ \|p\|\leq\Delta_{k}\end{array}}m_{k}(p).

\displaystyle p_{k}^{C}=\mathop{\rm argmin}_{\small\begin{array}[]{c}p=-t\nabla f_{{N_{k+1}}}(x_{k}),\,t>0\\ \|p\|\leq\Delta_{k}\end{array}}m_{k}(p).

∥\nabla f_{N_{k + 1}} (x_{k}) ∥ \leq ε_{g} \mbox an d N_{k} = N,

∥\nabla f_{N_{k + 1}} (x_{k}) ∥ \leq ε_{g} \mbox an d N_{k} = N,

h (N_{k + 1}) \leq r h (N_{k}) .

h (N_{k + 1}) \leq r h (N_{k}) .

h (N_{k + 1}) - h (N_{k + 1})

h (N_{k + 1}) - h (N_{k + 1})

m_{k} (0) - m_{k} (p_{k})

m_{k} (0) - m_{k} (p_{k})

f_{N} (x_{k}) - m_{k} (p_{k})

f_{N} (x_{k}) - m_{k} (p_{k})

\displaystyle\theta_{k+1}=\left\{\begin{array}[]{ll}&\theta_{k}\hskip 30.0pt\mbox{ if }\ {\rm{Pred}}_{k}(\theta_{k})\geq\eta(h(N_{k})-h(\widetilde{N}_{k+1}))\\ &\displaystyle\frac{(1-\eta)(h(N_{k})-h(\widetilde{N}_{k+1}))}{m_{k}(p_{k})-f_{N_{k}}(x_{k})+h(N_{k})-h(\widetilde{N}_{k+1})}\quad\mbox{otherwise}.\end{array}\right.

\displaystyle\theta_{k+1}=\left\{\begin{array}[]{ll}&\theta_{k}\hskip 30.0pt\mbox{ if }\ {\rm{Pred}}_{k}(\theta_{k})\geq\eta(h(N_{k})-h(\widetilde{N}_{k+1}))\\ &\displaystyle\frac{(1-\eta)(h(N_{k})-h(\widetilde{N}_{k+1}))}{m_{k}(p_{k})-f_{N_{k}}(x_{k})+h(N_{k})-h(\widetilde{N}_{k+1})}\quad\mbox{otherwise}.\end{array}\right.

Ared_{k} (θ_{k + 1}) \geq η Pred_{k} (θ_{k + 1}),

Ared_{k} (θ_{k + 1}) \geq η Pred_{k} (θ_{k + 1}),

Pred_{k} (θ)

Pred_{k} (θ)

Ared_{k} (θ)

Pred_{k} (θ_{k + 1}) \geq η (h (N_{k}) - h (N_{k + 1})) \geq 0,

Pred_{k} (θ_{k + 1}) \geq η (h (N_{k}) - h (N_{k + 1})) \geq 0,

θ_{k} (f_{N_{k}} (x_{k}) - m_{k} (p_{k}) - (h (N_{k}) - h (N_{k + 1}))) < (η - 1) (h (N_{k}) - h (N_{k + 1})),

θ_{k} (f_{N_{k}} (x_{k}) - m_{k} (p_{k}) - (h (N_{k}) - h (N_{k + 1}))) < (η - 1) (h (N_{k}) - h (N_{k + 1})),

f_{N_{k}} (x_{k}) - m_{k} (p_{k}) - (h (N_{k}) - h (N_{k + 1})) < 0.

f_{N_{k}} (x_{k}) - m_{k} (p_{k}) - (h (N_{k}) - h (N_{k + 1})) < 0.

θ (f_{N_{k}} (x_{k}) - m_{k} (p_{k}) - (h (N_{k}) - h (N_{k + 1}))) \geq (η - 1) (h (N_{k}) - h (N_{k + 1})),

θ (f_{N_{k}} (x_{k}) - m_{k} (p_{k}) - (h (N_{k}) - h (N_{k + 1}))) \geq (η - 1) (h (N_{k}) - h (N_{k + 1})),

θ \leq θ_{k + 1} = def \frac{( 1 - η ) ( h ( N _{k} ) - h ( N _{k + 1} ))}{m _{k} ( p _{k} ) - f _{N_{k}} ( x _{k} ) + h ( N _{k} ) - h ( N _{k + 1} )} .

θ \leq θ_{k + 1} = def \frac{( 1 - η ) ( h ( N _{k} ) - h ( N _{k + 1} ))}{m _{k} ( p _{k} ) - f _{N_{k}} ( x _{k} ) + h ( N _{k} ) - h ( N _{k + 1} )} .

\kappa_{\phi}=\max_{\small\begin{array}[]{c}1\leq i\leq N\\ x\in\Omega\end{array}}|\phi_{i}(x)|.

\kappa_{\phi}=\max_{\small\begin{array}[]{c}1\leq i\leq N\\ x\in\Omega\end{array}}|\phi_{i}(x)|.

f_{N} (x_{k}) - f_{M} (x_{k})

f_{N} (x_{k}) - f_{M} (x_{k})

∣ f_{N} (x_{k}) - f_{M} (x_{k}) ∣

∣ f_{N} (x_{k}) - f_{M} (x_{k}) ∣

h (N_{k}) - h (N_{k + 1}) \geq (1 - r) h (N_{k}) \geq (1 - r) \underline{h} .

h (N_{k}) - h (N_{k + 1}) \geq (1 - r) h (N_{k}) \geq (1 - r) \underline{h} .

m_{k} (p_{k}) - f_{N_{k}} (x_{k}) + h (N_{k}) - h (N_{k + 1}) \leq m_{k} (p_{k}) - f_{N_{k}} (x_{k}) + h (N_{k})

m_{k} (p_{k}) - f_{N_{k}} (x_{k}) + h (N_{k}) - h (N_{k + 1}) \leq m_{k} (p_{k}) - f_{N_{k}} (x_{k}) + h (N_{k})

\leq m_{k} (0) - f_{N_{k}} (x_{k}) + \overset{ˉ}{h} = f_{N_{k + 1}} (x_{k}) - f_{N_{k}} (x_{k}) + \overset{ˉ}{h}

\leq ∣ f_{N_{k + 1}} (x_{k}) - f_{N} (x_{k}) ∣ + ∣ f_{N} (x_{k}) - f_{N_{k}} (x_{k}) ∣ + \overset{ˉ}{h}

\leq σ (h (N_{k}) + h (N_{k + 1})) + \overset{ˉ}{h} \leq (2 σ + 1) \overset{ˉ}{h},

θ_{k + 1} \geq \frac{( 1 - η ) ( 1 - r ) h}{( 2 σ + 1 ) h ˉ} = def \underline{θ},

θ_{k + 1} \geq \frac{( 1 - η ) ( 1 - r ) h}{( 2 σ + 1 ) h ˉ} = def \underline{θ},

∥ B_{k + 1} ∥ \leq κ_{B} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Inexact restoration with subsampled trust-region methods for finite-sum minimization333The work of Bellavia and Morini was supported by Gruppo Nazionale per

il Calcolo Scientifico (GNCS-INdAM) of Italy. The work of the second author was supported by Serbian Ministry of Education, Science and Technological Development, grant no. 174030. Part of the research was conducted during a visit of the second author at Dipartimento di Ingegneria Industriale supported by Piano di Internazionalizzazione, Università degli Studi di Firenze.

Stefania Bellavia111Dipartimento di Ingegneria Industriale, Università degli Studi di Firenze, Viale G.B. Morgagni 40, 50134 Firenze, Italia. Members of the INdAM Research Group GNCS. Emails: [email protected], [email protected], Nata $\check{{\rm s}}$ a Krejić222 Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia, Email: [email protected]. , Benedetta Morini11footnotemark: 1

Abstract

Convex and nonconvex finite-sum minimization arises in many scientific computing and machine learning applications. Recently, first-order and second-order methods where objective functions, gradients and Hessians are approximated by randomly sampling components of the sum have received great attention.

We propose a new trust-region method which employs suitable approximations of the objective function, gradient and Hessian built via random subsampling techniques. The choice of the sample size is deterministic and ruled by the inexact restoration approach. We discuss local and global properties for finding approximate first- and second-order optimal points and function evaluation complexity results. Numerical experience shows that the new procedure is more efficient, in terms of overall computational cost, than the standard trust-region scheme with subsampled Hessians.

**Keywords: inexact restoration, trust-region methods, subsampling, local and global convergence, worst-case evaluation complexity. **

1 Introduction

The problem we consider in this paper is the following

[TABLE]

where $N$ is very large and finite and $\phi_{i}:\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}^{n}\rightarrow\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}$ . A number of important problems can be stated in this form, to start with problems in machine learning like classification problems, data fitting problems, sample average approximation of the objective function given in the form of mathematical expectation and so on.

The practical relevance of (1) resulted in a number of methods that are adjusted to this particular form of the objective function. In fact, for very large $N$ the cost of evaluating $f_{N}$ might be really high and the same is true for the gradient and even more for the Hessian evaluation. Therefore a number of methods that use approximate objective functions and/or first and second order derivatives, formed by partial sums, is proposed and analysed in literature, see e.g., [3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 19, 20, 25, 36, 37, 38].

Concerning the approximation of the objective function, one of the possible approaches is to use relatively rough approximations at early stages of the optimization procedure and gradually increase the accuracy to arrive at full precision at the late stage of the iterative procedure; the gradient is approximated accordingly. This way one hopes to save computational effort and yet to solve the original problem eventually. Very often the term scheduling is used to describe the approximation of the objective function by means of a partial sum. There is a number of algorithms proposed for the scheduling problem, ranging from simple heuristics that increase the number of terms in the partial sum that approximates the objective function by a certain percentage in each iteration, [5, 10, 20, 36] to more elaborate schemes that connect the progress achieved during the optimization procedure to the number of terms in the partial sum [1, 2, 3, 4, 5, 7, 8, 9, 13, 17, 27, 28, 29, 33, 35].

Besides the problem of scheduling, one has to decide between first- and second-order optimization method to be employed. A detailed survey is presented in [11]. A number of first-order methods has been proposed and analysed in the literature. Given that the main cost comes from large $N$ one might be tempted to conclude that computing Hessians, or some other second order information might be prohibitively costly and thus opt for a first order method, especially if the problem (1) should be solved with limited precision. However, recently there has been reported in several papers that careful adjustment and implementation of second order methods might be worth considering if the true Hessian is approximated by a partial sum of Hessians $\nabla^{2}\phi_{i}(x)$ consisting of a significantly smaller number of terms than $N$ . This way one can generate useful information with significantly smaller cost than the true Hessian and get enough advantage over first-order methods in terms of resilience to problem ill-conditioning and low sensitivity to parameter tuning, [6, 5, 10, 13, 12, 19, 34, 37, 38, 36].

The method we present here combines the Inexact Restoration (IR) framework with the trust-region optimization method [16] to simultaneously design the scheduling and the optimization procedure for solving (1) and represents a new approach for the problem under consideration.

The Inexact Restoration method, introduced in [31], is a constrained optimization tool particularly suitable for problems where one does not want to enforce feasibility in all iterations. The key idea of the IR approach is to treat feasibility and optimality in a modular way and to improve each one in separate procedures; the combination of feasibility and optimality is then monitored through a suitable merit function. Each iteration ensures the sufficient decrease of a suitable merit function and therefore, under certain assumption, convergence to a feasible optimal point. In [30, 31] the combination of the IR strategy with trust-region methods is proposed and analysed for general constrained problems.

The application of IR strategy to the unconstrained optimization problem (1) requires its reformulation as a constrained problem. Letting $I_{M}$ be an arbitrary nonempty subset of $\{1,\ldots,N\}$ of cardinality $|I_{M}|$ equal to $M$ , we reformulate problem (1) as

[TABLE]

Evaluating infeasibility in (2) is cheap while computing the objective function is expensive whenever $M$ is large. Thus, using the reasoning from [30, 31] we define a new algorithm that exploits the structure of the problem considered and takes advantage of the modular structure of IR and the trust-region optimization method at the same time. Specifically, the trust-region mechanism is applied to model $f_{M}$ at each iteration and the IR framework is applied to test for the acceptance of the iterates and to determine the scheduling sequence, i.e. the value of $M$ through the iterations. The test acceptance of the new iterate allows us to deal with inaccuracy in function and derivatives. In particular, the number of terms in the partial sum is fixed at the beginning of each iteration in the restoration phase and possibly changed in the optimality phase where the trial iterate is computed.

Clearly, the higher feasibility is the more accurate $f_{M}$ is with respect to $f_{N}$ . The new procedure has two important properties: partial sums, possibly consisting of small sets of $\phi_{i}$ ’s, can be used in the early stage of the iterative procedure to decrease the computational cost; the original objective function in (1) is recovered for all iteration indices large enough, thus allowing for the solution of the given problem. Clearly, when full precision of the objective function and the gradient is reached, one can rely on the theory and machinery of standard trust-region methods [16].

The scheme presented here applies to both first- and second-order trust-region models. If a linear model is used, the resulting procedure is a subsampled gradient method with variable stepsize. When second-order models are used, the Hessian can be approximated using a subset of the sample used to approximate function and gradient. The error in such Hessian approximation plays an important role in the asymptotic convergence rate. In the case of strongly convex problems, the analysis for local linear convergence rate is presented, both in deterministic and probabilistic settings, and an adaptive choice of the sample for Hessian approximation is proposed.

We also provide a function evaluation complexity result which resembles the classical result for the trust-region methods for (1) and the results obtained in [8]. It is shown that at most $O(\varepsilon^{-2})$ evaluations of the possibly subsampled function $f_{M}$ , $M\leq N$ , and its derivatives are needed to compute a first-order approximate critical point. Then the worst-case complexity of the standard trust-region is recovered with expected significant computational savings due to scheduling.

Our approach considerably differs from the IR procedure and trust-region method in [30, 31] since the objective function in our formulation changes with $M$ through the iterations. It also differs from IR approaches in [29, 7, 8] that employ approximate objective function and its derivatives and have been successfully applied to constrained and unconstrained problems, including problem (1); in papers [29, 7] the IR is combined with a line search strategy, while in [8] the considered problem is constrained and regularization techniques are used in the optimization phase. The approach presented here relays on [8] in terms of general idea but the problem is more specific being a finite-sum rather than a general objective function computed approximately and being unconstrained. These specifications allow us to design an efficient sample update rule which is connected with the trust-region size.

The value of $M$ is fixed via a deterministic rule while the trust-region schemes in [25, 38, 9], approximating either functions, gradients and Hessians [25, 9] or Hessians only [38], are designed using sample sets whose cardinality is determined by high probability and nonasymptotic convergence analysis.

The nature of IR allows changes in the feasibility through iterations and the change is not necessarily monotone, i.e., the cardinality of the subset that defines the approximate objective can both increase and decrease, depending on the feedback from the trust-region progress made in each iteration. The case where $M$ is increased by a prefixed percentage at each iteration is a particular case of our strategy. In this latter case our method differs from a straightforward subsampled trust-region procedure with increasing sample size in both the merit function and the acceptance criterion. Remarkably, their employment allow to prove optimal complexity results that otherwise require adaptive accuracy requirements [9].

This paper is organized as follows. In Section 2 we present our method and prove that it is well defined. Furthermore, we prove that full accuracy is eventually reached and that the set of standard assumptions yield first-order stationary points. Some issues concerning the realization of the procedure are considered in Section 3; the scheduling rule is modified to avoid unproductive decrease in precision and a discussion on first and second order trust-region models is provided. Section 4 deals with strongly convex problems; we prove q-linear convergence as well as q-linear convergence in expectation under probabilistic bounds for Hessian subsampling. Section 5 provides worst-case function evaluation complexity. The numerical performance of the proposed method is tested on a set of classification problems and the results are reported in Section 6.

2 The Algorithm

Let $I_{M}$ be an arbitrary nonempty subset of $\{1,\ldots,N\}$ of cardinality $|I_{M}|$ equal to $M,$

[TABLE]

and reformulate (1) as the constrained problem (2). We measure the level of infeasibility with respect to the constraint $M=N$ by the function $h$ with the following properties.

Assumption 2.1

Let $h:\hbox{\rm I\kern-1.99997pt\hbox{\rm N}}\rightarrow\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}$ be a monotone, strictly decreasing function such that $h(1)>0$ , $h(N)=0$ .

This assumption implies

[TABLE]

for $M\in\hbox{\rm I\kern-1.99997pt\hbox{\rm N}}$ and $\underline{h}=h(N-1)$ and $\bar{h}=h(1)$ . One possible choice for $h$ is $h(M)=(N-M)/N,\ 0<M\leq N$ .

Suppose $\phi_{i}$ , $1\leq i\leq N$ , be continuously differentiable and let $\|\cdot\|$ denote the 2-norm.

The method introduced in this section combines the Inexact Restoration, an approach for optimization of functions evaluated inexactly, with the trust-region methods. We will refer to it as iretr. It employs the merit function

[TABLE]

with $\theta\in(0,1)$ and aims to minimize both $f_{M}$ and the infeasibility $h$ . Since the reductions in the values of $f_{M}$ and $h$ may not be achieved simultaneously, a weight $\theta$ is used and a trust-region method is employed to generate a sequence $\{(x_{k},N_{k},\theta_{k})\}$ such that $\Psi(x_{k},N_{k},\theta_{k})<\Psi(x_{k-1},N_{k-1},\theta_{k})$ . The main theoretical properties of the new method, shown in the next section, are: the sequence $\{\theta_{k}\}$ is nonicreasing and uniformly bounded away from zero, $N_{k}=N$ for all $k$ sufficiently large and $\|\nabla f_{N}(x_{k})\|\rightarrow 0$ as $k\rightarrow\infty$ .

Concerning the trust-region problem, suppose that $x_{k}$ is given. Then, a trial sample size $N_{k+1}$ is selected, $I_{N_{k+1}}\subseteq\{1,\ldots,N\}$ is chosen and the model $m_{k}(p)$ for $f_{N_{k+1}}$ around $x_{k}$ of the form

[TABLE]

is built. Here $\nabla f_{N_{k+1}}$ denotes the gradient of $f_{N_{k+1}}$ and $B_{k+1}\in\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}^{n\times n}$ is a symmetric approximation to the Hessian $\nabla^{2}f_{N_{k+1}}(x_{k})$ in case $\phi_{i}$ , $1\leq i\leq N$ , are twice continuously differentiable. Trivially $m_{k}(0)=f_{N_{k+1}}(x_{k})$ and the smaller $h(N_{k+1})$ , the larger becomes the accuracy in the approximation to $f_{N}$ and $\nabla f_{N}$ . Then, letting $\Delta_{k}>0$ denote the trust-region radius and ${\cal B}_{k}=\{x_{k}+p\in\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}^{n}:\|p\|\leq\Delta_{k}\}$ be the trust-region, the trust-region problem is

[TABLE]

As in the standard trust-region schemes, problem (6) is solved approximately and the computed step $p_{k}$ is required to provide a sufficient reduction in the model in terms of the Cauchy step $p_{k}^{C}$ , i.e., the minimizer of the model $m_{k}$ along the steepest descent $-\nabla f_{N_{k+1}}(x_{k})$ within ${\cal B}_{k}$

[TABLE]

Then, if a sufficient reduction in the function $\Psi$ is achieved, the step $p_{k}$ is accepted and the new iterate $x_{k+1}$ is set equal to $x_{k}+p_{k}$ . Otherwise, the step is rejected and the trust-region radius is reduced. The specific form of the predicted and actual reduction used in the acceptance criterion will be given below, after detailing the Algorithm’s steps.

Now we present the new Algorithm iretr which aims at finding an $\varepsilon_{g}$ –accurate first-order optimality point defined as follows

[TABLE]

and comment on it, see Algorithm 1.

Given $x_{k}$ , $N_{k}$ and $\theta_{k}$ we describe the $k$ th iteration. In Step 1 the feasibility is improved. If $N_{k}<N$ , we predict the cardinality $\widetilde{N}_{k+1}$ such that the value $h(\widetilde{N}_{k+1})$ is smaller than $h(N_{k})$ and at most equal to a prefixed fraction of $h(N_{k})$ . In case $h(M)=(N-M)/N,\ 0<M\leq N$ , taking into account that $N_{k}$ and $\tilde{N}_{k+1}$ are integers it can be shown that condition (11) holds if and only if $0<N_{k}<\tilde{N}_{k+1}$ provided that $h(2)/h(1)<r<1$ .

In Step 2, an attempt is made to reduce the computational effort i.e. to enlarge infesibility; $N_{k+1}$ is chosen such that $N_{k+1}\leq\widetilde{N}_{k+1}$ and the bounded deterioration (12) on the value of $h(N_{k+1})$ with respect to $h(\widetilde{N}_{k+1})$ is imposed. In principal such control allows us to reduce $N_{k+1}$ below both $N_{k}$ and $\widetilde{N}_{k+1}$ . On the other hand, the upper bound in (12) depends on the trust-region radius and $N_{k+1}$ will be equal to $\widetilde{N}_{k+1}$ whenever $\Delta_{k}$ is small enough. If $N_{k}=N$ , the stopping criterion $\|\nabla f_{N_{k+1}}(x_{k})\|\leq\varepsilon_{g}$ is checked. This is supported by the fact that, when $N_{k}=N$ , we may expect $N_{k+1}$ be close to $N$ and $\nabla f_{N_{k+1}}(x_{k})$ be close to $\nabla f_{N}(x_{k})$ in a probabilistic sense; we will further discuss this issue in Section 3. If (10) is not met, using $I_{N_{k+1}}\subseteq\{1,\ldots,N\}$ , the trust-region model $m_{k}(p)$ is built and (6) is approximately solved. The computed step $p_{k}$ is required to provide the sufficient reduction (13) in the model in terms of the Cauchy step $p_{k}^{C}$ .

The acceptance rule for $p_{k}$ in Step 5 depends on the predicted and actual reduction defined as follows:

[TABLE]

where the last equality follows from (4). We observe that ${\rm{Pred}}_{k}$ uses the last accepted values $f_{N_{k}}(x_{k})$ and $N_{k}$ and is a linear combination of two predicted values: the predicted model decrease $f_{N_{k}}(x_{k})-m_{k}(p_{k})$ and the predicted infeasibility decrease $h(N_{k})-h(\widetilde{N}_{k+1})$ . As for ${\rm{Ared}}_{k}$ , given $\theta$ , it measures the actual reduction of $\Psi$ .

The new penalty parameter $\theta_{k+1}$ computed in Step 4 is the largest value that ensures

[TABLE]

as $h(N_{k})-h(\widetilde{N}_{k+1})\geq 0$ by (11). In case $N_{k}<\widetilde{N}_{k+1}$ such condition implies ${\rm{Pred}}_{k}(\theta_{k+1})$ strictly positive. In case $N_{k}=\widetilde{N}_{k+1}=N$ , ${\rm{Pred}}_{k}(\theta)$ reduces to $\theta(f_{N}(x_{k})-m_{k}(p_{k}))$ and from (13) it follows ${\rm{Pred}}_{k}(\theta)\geq\tau\theta(m_{k}(0)-m_{k}(p_{k}^{C}))>0$ whenever $N_{k+1}=N$ . On the other hand, in case $N_{k}=\widetilde{N}_{k+1}=N$ and $N_{k+1}<N$ , Step 3 is necessary to enforce positivity of ${\rm{Pred}}_{k}(\theta_{k+1})$ as $m_{k}(0)=f_{N_{k+1}}(x_{k})\neq f_{N}(x_{k})$ . In fact, ${\rm{Pred}}_{k}(\theta)>0$ follows from taking a step such that $f_{N}(x_{k})-m_{k}(p_{k})\geq\tau(m_{k}(0)-m_{k}(p_{k}^{C}))$ . We further notice that attempting $N_{k+1}<N$ when $N_{k}=N$ is meaningful if the model is a good approximation of $f_{N}$ around $x_{k}$ and thus one can expect some progress, or at least a limited deterioration in the value of the full objective function $f_{N}$ . Enforcing $f_{N}(x_{k})-m_{k}(p_{k})\geq\tau(m_{k}(0)-m_{k}(p_{k}^{C}))$ is a minimal requirement on the agreement between $f_{N}$ at $x_{k}$ and the model at the trial step.

Finally, in Step 5 the step $p_{k}$ is accepted if the ratio between the predicted reduction ${\rm{Pred}}_{k}(\theta_{k+1})$ and the actual reduction $Ared_{k}(\theta_{k+1})$ is larger than a prefixed scalar $\eta$ , otherwise the trust-region radius is reduced and the procedure is repeated starting from Step 2.

Notice that the trust-region size can be reduced several times during one iteration, i.e., only successful iterations yield to the increment of the iteration counter $k$ . To emphasize this fact, within each iteration, we introduce an additional counter ${\cal T}_{k}$ for the number of decreases of the trust-region size. The feasibility measure $N_{k+1}$ might be modified several times within one iteration as well, but changes due to (12) and (14) do not necessarily correspond to the number of reductions of the trust-region size. The penalty parameter $\theta_{k}$ has an analogous behaviour. For this reason and to avoid notation clustering, we do not introduce additional counters for $N_{k+1}$ and $\theta_{k+1}$ within the same iteration.

We start the analysis of the new method proving that the $k$ th iteration of Algorithm iretr is well defined since appropriate values of $N_{k+1}$ and $\theta_{k+1}$ will be reached in a finite number of attempts. Here and in Section 5, $B_{k+1}$ can be the null matrix and our analysis covers the use of both first-order and second-order models.

Lemma 2.1

Steps 2 and 3 of Algorithm iretr are well-defined.

Proof. For any positive $\Delta_{k}^{({\cal T}_{k})}$ inequality (12) trivially holds in the limit case $N_{k+1}=\widetilde{N}_{k+1}$ . Analogously, Step 3 can not be repeated infinitely many times as for ${\cal T}_{k}$ large enough, $\Delta_{k}^{({\cal T}_{k})}$ will be small enough to yield $N_{k+1}=\widetilde{N}_{k+1}=N$ . $\Box$

We now make the following assumption.

Assumption 2.2

$\{x_{k}\}\subset\Omega$ * where $\Omega$ is a compact set in $\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}^{n}$ .*

Lemma 2.2

Let Assumptions 2.1 and 2.2 hold. Suppose that $\phi_{i}$ , $1\leq i\leq N$ , are continuous in $\Omega.$ Then the sequence $\{\theta_{k}\}$ built in Algorithm iretr is positive, nonincreasing and bounded away from zero, $\theta_{k+1}\geq\underline{\theta}>0$ with $\underline{\theta}$ independent of $k$ and (19) holds.

Proof. We have $\theta_{0}>0$ and proceed by induction assuming that $\theta_{k}$ is positive. First consider the case where $N_{k}=\widetilde{N}_{k+1}$ (equivalently $N_{k}=\widetilde{N}_{k+1}=N$ ). Then $h(N_{k})-h(\widetilde{N}_{k+1})=0$ and, due to Step 3, ${\rm{Pred}}_{k}(\theta)=\theta(f_{N_{k}}(x_{k})-m_{k}(p_{k}))>0$ for any positive $\theta$ . Thus $\theta_{k+1}=\theta_{k}$ and (19) holds.

Let now suppose $N_{k}<\widetilde{N}_{k+1}$ . If inequality ${\rm{Pred}}_{k}(\theta_{k})\geq\eta(h(N_{k})-h(\widetilde{N}_{k+1}))$ holds then $\theta_{k+1}=\theta_{k}$ satisfies (19). Otherwise, we have

[TABLE]

and since the right hand-side is negative by construction, it follows

[TABLE]

Consequently, ${\rm{Pred}}_{k}(\theta)\geq\eta(h(N_{k})-h(\widetilde{N}_{k+1}))$ is satisfied if

[TABLE]

i.e., if

[TABLE]

Hence $\theta_{k+1}$ is the largest value satisfying (19) and $\theta_{k+1}<\theta_{k}.$

Let us now prove that $\theta_{k+1}\geq\underline{\theta}.$ Using Assumptions 2.2 and continuity of $\phi_{i}$ , $1\leq i\leq N$ , let

[TABLE]

Then, using (3), for $M$ such that $0<M\leq N$ there holds

[TABLE]

and therefore for any integer $M$ , $0<M\leq N$

[TABLE]

Also note that by (11) and (3)

[TABLE]

Moreover,

[TABLE]

and $\theta_{k+1}$ in (15) satisfies

[TABLE]

and the proof is completed. $\Box$

To establish the well-definiteness of Steps 4 and 5, we make the following assumptions.

Assumption 2.3

The gradients $\nabla\phi_{i}$ , $1\leq i\leq N$ , are Lipschitz continuous on the segments $[x_{k},x_{k}+p_{k}]$ , for all $k\geq 0$ and for all $p_{k}$ generated in the repetition of Steps 2–5.

Assumption 2.4

There exists positive $\kappa_{B}$ such that for all $k$

[TABLE]

By Assumption 2.3 there is a $t\in(0,1)$ such that

[TABLE]

[18, Lemma 4.1.2]. Consequently, using Assumptions 2.2–2.4 we have

[TABLE]

with $\kappa_{T}=(L+\kappa_{B}/2)$ and $L$ depending on the Lipschitz constants of $\nabla\phi_{i}$ , $1\leq i\leq N$ .

In the next result we use the key inequality

[TABLE]

with $\beta=1+\kappa_{B}$ , see [16, Theorem 6.3.1].

Lemma 2.3

Let Assumptions 2.1– 2.4 hold. Assume $\theta_{k}\in(0,1)$ and $\theta_{k+1}$ as in (15). Then, Steps 4 and 5 of Algorithm IRETR are well defined.

Proof. Let us prove that ${\rm{Ared}}_{k}(\theta_{k+1})-\eta{\rm{Pred}}_{k}(\theta_{k+1})$ is strictly positive if $\Delta_{k}^{({\cal T}_{k})}$ is small enough, i.e., after a finite number ${\cal T}_{k}$ of reductions of the trust-region radius. Let $\theta_{k+1}$ be computed at Step 4 for some $\Delta_{k}^{({\cal T}_{k})}.$ By (17) and (18), we have

[TABLE]

We now distinguish three cases.

$i)$ If $h(N_{k})-h(\widetilde{N}_{k+1})>0$ then using (19) we get

[TABLE]

The first term in the above right hand-side is strictly positive and uniformly bounded from below due to (22). On the other hand, by (23) and (12)

[TABLE]

Therefore, for $\Delta_{k}^{{\cal T}_{k}}$ small enough we have ${\rm{Ared}}_{k}(\theta_{k+1})-\eta{\rm{Pred}}_{k}(\theta_{k+1})>0$ and the iteration finishes.

$ii)$ If $h(N_{k})-h(\widetilde{N}_{k+1})=0$ (equivalently $N_{k}=\widetilde{N}_{k+1}=N$ ) and $N_{k+1}=N$ then using (17) and (18) we have

[TABLE]

Thus, by (13), (23) and (24), if $\Delta_{k}^{({\cal T}_{k})}$ is small enough we get

[TABLE]

and the last bound is positive for some finite ${\cal T}_{k}$ .

$iii)$ Finally, suppose $h(N_{k})-h(\widetilde{N}_{k+1})=0$ (equivalently $N_{k}=\widetilde{N}_{k+1}=N$ ) and $N_{k+1}<N$ then using (17) and (18) we have

[TABLE]

Thus, by Step 3 of Algorithm 2.1, (23) and (24), if $\Delta_{k}^{{\cal T}_{k}}$ is small enough we get

[TABLE]

and the last bound is positive for some finite ${\cal T}_{k}$ . $\Box$

The analysis presented in the rest of this section concerns the case where Algorithm iretr is invoked with $\varepsilon_{g}=0$ and does not terminate in a finite number of steps. Each iteration $k-1$ of the Algorithm ends up with the accepted iterate $x_{k}=x_{k-1}+p_{k-1}$ and the final sample size $N_{k}.$ In the following statements we are going to prove that $h(N_{k})\to 0$ and therefore the full sample is eventually reached and maintained.

Theorem 2.4

Let Assumptions 2.1–2.4 hold. Then $h(N_{k})\to 0$ .

Proof. Inequalities (11) and (19) imply

[TABLE]

We prove by contradiction that $\lim_{k\rightarrow\infty}{\rm{Pred}}_{k}(\theta_{k+1})=0$ .

Taking into account that at termination of iteration $k$ we have $x_{k+1}=x_{k}+p_{k}$ and ${\rm{Ared}}_{k}(\theta_{k+1})\geq\eta{\rm{Pred}}_{k}(\theta_{k+1})$ , using (16) and (18) we have

[TABLE]

Using (20) we can rewrite the above inequality as

[TABLE]

Then using recurrence, and $-(1-\theta_{k+1})h(N_{k+1})\leq 0$ we get

[TABLE]

Repeating this argument, using $(\theta_{j}-\theta_{j+1})\geq 0$ from Lemma 2.2 and (3) we obtain

[TABLE]

By (21) and (3) we have

[TABLE]

and therefore

[TABLE]

where

[TABLE]

is independent of $k$ .

Noting that ${\rm Pred}_{j}(\theta_{j+1})\geq 0$ , we can conclude that if ${\rm Pred}_{j}(\theta_{j+1})$ is not tending to zero, then $\sum_{j=0}^{\infty}{\rm Pred}_{j}(\theta_{j+1})$ is diverging and this implies that $f_{N}$ is unbounded below in $\Omega$ . This contradicts the compactness of $\Omega$ . $\Box$

Corollary 2.5

Let Assumptions 2.1–2.4 hold. Then $N_{k}=N$ for all $k$ sufficiently large.

Proof. By Theorem 2.4 and Assumption 2.1, it follows $h(N_{k})<h(N-1)$ for all $k$ sufficiently large. This implies $N_{k}=N$ . $\Box$

Corollary 2.6

Let Assumptions 2.1–2.4 hold. Then, for $k$ sufficiently large, the iterations are generated by a (standard) trust-region scheme on $f_{N}$ and

i) $\mathop{\rm liminf}_{k\rightarrow\infty}\|\nabla f_{N}(x_{k})\|=0$ .

ii) $\lim_{k\rightarrow\infty}\|\nabla f_{N}(x_{k})\|=0$ , provided that $f_{N}$ is Lipschitz continuous in $\Omega$ .

Proof. By Corollary 2.5 we know that at termination of iteration $k-1$ we have $N_{k}=N$ for all $k$ sufficiently large. Thus eventually, $x_{k+1}=x_{k}+p_{k}$ with $p_{k}$ satisfying (16) which now takes the form of the standard acceptance rule of the trial point in trust-region methods, i.e,

[TABLE]

As a consequence, Theorem 4.6 in [32] yields item $i)$ . Item $ii)$ is guaranteed by [32, Theorem 4.7]. $\Box$

3 On the realization of the algorithm

The realization of Algorithm iretr raises many issues and in this section we discuss two important aspects: the form of the model used and related properties, and a computationally convenient adaptation of the rule for choosing $N_{k+1}$ eventually. We will further address implementation issues in Section 6.

Various models of the form (5) can be built. One possibility is the linear model

[TABLE]

which gives rise to a gradient method and step $p_{k}$

[TABLE]

Namely, Algorithm iretr becomes a subsampled gradient method with variable stepsize determined accordingly to the trust-region strategy.

Another possibility is to use quadratic models of the form

[TABLE]

and fully exploit the advantages of the trust-region framework. If all functions $\phi_{i}$ are twice continuously differentiable one can build the quadratic model

[TABLE]

with $1\leq D_{k+1}\leq N_{k+1}$ and $I_{D_{k+1}}\subseteq I_{N_{k+1}}$ . In fact, the Hessian matrix $\nabla^{2}f_{N_{k+1}}(x)$ is approximated via subsampling by

[TABLE]

The cardinality of $I_{D_{k+1}}$ now controls the precision of Hessian approximation and allows for trade-off between precision and computational costs. This particular form of Hessian approximation will be analysed in details for strongly convex functions in the next section.

The use of quadratic models is crucial for the computation of $(\varepsilon_{g},\varepsilon_{H})$ -approximate second order critical point of nonconvex problems (1), i.e., a point $x$ such that

[TABLE]

Supposing that full precision is reached, $N_{k}=N$ , the trust-region problem (6) has to be solved approximately finding $p_{k}$ such that

[TABLE]

where $p_{k}^{C}$ is the Cauchy point (9) and $p_{k}^{E}$ is a negative curvature direction such that $(p_{k}^{E})^{T}\nabla^{2}f_{D_{k+1}}(x_{k})p_{k}^{E}\leq\upsilon\lambda_{\min}(\nabla^{2}f_{D_{k+1}}(x_{k}))\|p_{k}^{E}\|^{2}$ for some $\upsilon\in(0,1]$ , [16, §6.6].

We refer to [38, Theorem 1] for results on the computation of approximated second-order optimal solutions using trust-region methods with full function and gradient and subsampled Hessian.

Let us now address the choice of the stopping criterion in Algorithm iretr. Notice that the Algorithm may stop even if full precision at iteration $k$ is not achieved (i.e. $N_{k+1}<N$ ), provided that $N_{k}=N$ . This choice is supported by observing that suitable sample sizes provide an accurate approximation $\nabla f_{N_{k+1}}(x_{k})$ to $\nabla f_{N}(x_{k})$ . In fact, by [4, Theorem 6.2] $\nabla f_{N_{k+1}}(x_{k})$ is sufficiently accurate with fixed probability at least $1-p_{g}$ , i.e.,

[TABLE]

if the cardinality $N_{k+1}$ satisfies

[TABLE]

with $E(\|\nabla\phi_{i}(x_{k})-\nabla f_{N}(x_{k})\|^{2})\leq V_{g}$ and $\max_{i\in\{1,...,N\}}|\nabla\phi_{i}(x)|\leq\zeta(x)$ , and $I_{N_{k+1}}$ is sampled uniformly in $\{1,2,\ldots,N\}$ .

We conclude this section observing that, in the current form of the algorithm, at each iteration an attempt is made to use $N_{k+1}<N$ (see Step 2). By Corollary 2.5 we know that, for $k$ sufficiently large, such a value will be rejected and this fact implies useless repetitions of Steps 2–5. To overcome this drawback, we replace (12) with

[TABLE]

Then, the following result holds.

Corollary 3.1

Suppose (37) and (38) hold. For $k$ sufficiently large, the use of sets $I_{N_{k+1}}$ of cardinality smaller than $N$ is not attempted.

Proof. By Corollary 2.5 and Corollary 2.6, we know that $N_{k}=N$ for all $k$ sufficiently large and $\|\nabla f_{N}(x_{k})\|$ tends to zero. Thus, letting $k_{*}$ be the iteration index such that $\|\nabla f_{N}(x_{k})\|<h(N-1)$ , $\forall k\geq k_{*}$ , it follows $N_{k+1}=N$ , $\forall k\geq k_{*}$ . $\Box$

4 Strongly convex problems

In this section we assume that $f_{N}$ is strongly convex with strongly convex functions $\phi_{i}$ , $1\leq i\leq N$ , and analyze the local behaviour of iretr method when full precision for the function and the gradient has been reached and a quadratic model of the following form is used:

[TABLE]

with $1\leq D_{k+1}\leq N$ , $I_{D_{k+1}}\subseteq I_{N}$ . Thus, we are focusing on the local behaviour of the trust-region method employing second order models with exact function and gradient and subsampled Hessian. Such a method has been investigated in [38] with respect to iteration complexity but not with respect to local convergence.

The additional assumptions used in this section are stated below.

Assumption 4.1

The functions $\phi_{i},\;i=1,\ldots,N$ , are twice continuously differentiable and strongly convex in $\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}^{n}$ ,

[TABLE]

where, given two matrices $A$ and $B$ , $A\preceq B$ means that $B-A$ is positive semidefinite.

Trivially, $f_{N}$ is strongly convex and admits an unique minimizer $x^{*}$ . Moreover, $B_{k+1}$ is as in (33), both $\lambda_{\min}(B_{k+1})\geq\lambda_{1}$ and $\lambda_{\max}\leq\lambda_{n}$ hold and Corollary 2.6 implies $\lim_{k\rightarrow\infty}x_{k}=x^{*}$ .

The following theorem analyzes the behaviour of $\{x_{k}\}$ denoting

[TABLE]

the error between $\nabla^{2}f_{N}(x_{k})$ and $\nabla^{2}f_{D_{k+1}}(x_{k})$ . We also invoke the assumption below.

Assumption 4.2

The Hessian $\nabla^{2}f_{N}$ is Lipschitz continuous on ${\cal B}_{\delta}(x^{*}):=\{x\in\mathbb{R}^{n}:\|x-x^{*}\|\leq\delta\}$ with Lipschitz constant $2L_{H}$ .

Theorem 4.1

Suppose that Assumptions 2.1, 2.2, 4.1, 4.2 hold. Let $\{x_{k}\}$ be generated by Algorithm iretr, $\varepsilon_{g}$ as in (10), $\beta$ as in (24), $\eta$ as in the Algorithm iretr and $B_{k+1}$ given by (33).

i) Let $\epsilon\in(0,1)$ and $D_{k+1}$ such that

[TABLE]

Then, if $k$ is sufficiently large, $p_{k}$ is accepted in the first pass in Step 5 and ${\cal T}_{k}=0.$

*ii) There exist sufficiently small $\delta>0$ and sufficiently large $D$ such that, for all $x_{k}\in{\cal B}_{\delta}(x^{*})$ and $D_{k+1}=D,$ the error $\|x_{k}-x^{*}\|$ reduces linearly, i.e., $\|x_{k+1}-x^{*}\|<\tilde{\tau}\|x_{k}-x^{*}\|$ for some $\tilde{\tau}\in(0,1)$ . *

Proof. $i)$ Let us consider $k$ sufficiently large such that $N_{k+1}=N$ at termination of iteration $k$ . Lemma 6.5.1 in [16] gives

[TABLE]

Let us consider the step $p_{k}$ returned by iteration $k$ . Combining (42) with (24) and (13) we obtain

[TABLE]

with $\omega=\tau\min\{\frac{\lambda_{1}}{2\beta},1\}\frac{\lambda_{1}}{2}$ .

At Step 5 of the Algorithm, (16) has the form $f_{N}(x_{k})-f_{N}(x_{k}+p_{k})\geq\eta(m_{k}(0)-m_{k}(p_{k}))$ . By Assumption 4.2 and (40), it follows

[TABLE]

where $t$ is some scalar in $t\in(0,\,1)$ [16, Theorem 3.1.2]. Now, given $\epsilon\in(0,1)$ and $D_{k+1}$ as in (41), (42) and Corollary 2.6 imply $\|p_{k}\|\leq\epsilon$ for $k$ large enough, say $k\geq\bar{k}$ , and (41) implies the acceptance of the step. Then, $\Delta_{k}$ is not reduced and $\Delta_{k}\geq\Delta_{\bar{k}}$ for any $k\geq\bar{k}$ .

$ii)$ Using (42), Corollary 2.6 and item $i)$ we can conclude that the trust-region bound becomes inactive for $k$ sufficiently large, i.e., the step

[TABLE]

is accepted eventually. Consequently, using multivariate calculus results [18, Lemma 4.1.12] and Assumption 4.1

[TABLE]

Thus, the claim follows if $\delta$ and $D_{k+1}=D$ are such that $\tilde{\tau}:=\frac{L_{H}\delta+e(D)}{\lambda_{1}}<1$ and $D$ satisfies (41).

$\Box$

Item $ii)$ above may require a rather large value $D_{k+1}=D$ which is adverse for practical computation. A more stringent condition on $D_{k+1}$ of the form $e(D_{k+1})=O(\|\nabla f_{N}(x_{k})\|)$ yields quadratic convergence but again such $D_{k+1}$ might be very close to $N$ . We next investigate on the more realistic situation where the Hessian accuracy requirement in (41) is guaranteed only with high-probability and provide a linear convergence result in expectation.

Let us now suppose that, given an accuracy requirement $\chi_{H}>0$ , the probability of $\|\nabla^{2}f_{N}(x_{k})-\nabla^{2}f_{D_{k+1}}(x_{k})\|$ being smaller than $\chi_{H}$ is larger than $1-p_{H}$ :

[TABLE]

for $p_{H}\in(0,1)$ . If the subsample $I_{D_{k+1}}$ is chosen randomly and uniformly, then the lower bound on the sample size ensuring (45) takes the form

[TABLE]

The above bound is derived in [5, Lemma 3.1] and a similar bound is given in [3, Lemma 4].

We now provide a linear convergence result in expectation; the step $p_{k}$ taken is the global minimizer of (6), i.e.,

[TABLE]

for some $\nu_{k}\geq 0$ , see [16, Theorem 7.2.1].

Theorem 4.2

Suppose that Assumptions 2.1, 2.2, 4.1, 4.2 hold. Let $\{x_{k}\}$ be generated by Algorithm iretr invoked with $\varepsilon_{g}=0$ in (10), $B_{k+1}$ as in (33) and $p_{k}$ being the global minimizer of (6). If (45) holds and there exists a $\nu^{*}\in(0,1)$ such that for all $k$

[TABLE]

then there exist $\delta$ , $\chi_{H}$ , $p_{H}$ sufficiently small such that

[TABLE]

for all $k$ large enough and some $\bar{\tau}\in(0,1).$

Proof. Take $\delta\in(0,1)$ , $\chi_{H}>0$ , $p_{H}\in(0,1)$ such that

[TABLE]

for some $\bar{\tau}\in(0,1).$ Let $k$ large enough such that $x_{k}\in{\cal B}_{\delta}(x^{*})$ .

Denote by $A_{k}$ the event

[TABLE]

Then $P(A_{k})\geq 1-p_{H}$ and $P(\bar{A}_{k})<p_{H},$ where $\bar{A}_{k}$ denotes the event $A_{k}$ does not occur. If $A_{k}$ happens then using multivariate calculus results [18, Lemma 4.1.12], Assumption 4.1, (47) and (49)

[TABLE]

Otherwise, if $\bar{A}_{k}$ is realized then by (42) we have

[TABLE]

Therefore,

[TABLE]

where we have used (50) and $p(A_{k})\leq 1$ . $\Box$

5 Worst-case iteration and evaluation complexity to first-order critical points

In this section we provide an upper bound on the number of iterations and function-evaluations needed to find an $\varepsilon_{g}$ -accurate first-order optimality point (10). The number of function-evaluations is intended as the number of evaluations of functions of the form $f_{M}$ , for some $M\leq N$ . We recall that a standard trust-region approach shows ${\cal{O}}(\varepsilon_{g}^{-2})$ worst-case iteration and full function complexity for first-order optimality [22].

Recalling that $h(N_{k})-h(\widetilde{N}_{k+1})=0$ is equivalent to $N_{k}=\widetilde{N}_{k+1}=N$ , consider the following partition of iteration indices $k$ :

•

${\cal I}_{1}=\{k\geq 0\mbox{ s.t. }h(N_{k})-h(\widetilde{N}_{k+1})>0\}$ ,

•

${\cal I}_{2}=\{k\geq 0\mbox{ s.t. }h(N_{k})=h(\widetilde{N}_{k+1})=0,N_{k+1}=N\mbox{ and }\|\nabla f_{N}(x_{k})\|>\varepsilon_{g}\}$ ,

•

${\cal I}_{3}=\{k\geq 0\mbox{ s.t. }h(N_{k})=h(\widetilde{N}_{k+1})=0,\,N_{k+1}<N\mbox{ and }\|\nabla f_{N_{k+1}}(x_{k})\|>\varepsilon_{g}\}$ .

The value of $N_{k+1}$ may change within iteration $k$ before acceptance of the iterate; above $N_{k+1}$ is the value at the end of iteration $k$ , i.e., the value used for building the accepted iterate $x_{k+1}$ .

Our analysis is carried out fixing $\gamma=1$ in Algorithm iretr and the first result provides a lower bound on the trust-region radius at termination of iteration $k$ .

Lemma 5.1

Let Assumptions 2.1–2.4 hold. Suppose furthermore $\gamma=1$ in Algorithm iretr. Then,

i) for any $k\in{\cal I}_{1}$

[TABLE]

ii) for any $k\in{\cal I}_{2}\cup{\cal I}_{3}$ ,

[TABLE]

for some positive $\Gamma$ and $\mu$ as in the Algorithm.

Proof. The initial $\Delta_{k}$ may be reduced in Steps 3 and 5 of the Algorithm. Step 3 is performed only if $k\in{\cal I}_{3}$ .

Let us consider case $i)$ . Since $\gamma=1$ equation (26) becomes

[TABLE]

From (25), inequality (16) is satisfied whenever

[TABLE]

Thus, using (11), if

[TABLE]

then (16) holds and the claim $i)$ follows from the rule for decreasing $\Delta_{k}$ in Step 5 of Algorithm iretr.

Let us consider case $ii)$ . Concerning Step 3, it is performed as long as $N_{k+1}<N$ . Then, (12) ensures that at termination of the loop in Steps 2–3

[TABLE]

Concerning Step 5, first suppose $k\in{\cal I}_{2}$ and $\Delta_{k}\leq\varepsilon_{g}/\beta$ with $\beta$ as in (24). Using (27) we can conclude that if

[TABLE]

then (16) is satisfied.

Suppose now $k\in{\cal I}_{3}$ and $\Delta_{k}\leq\varepsilon_{g}/\beta$ . Using $\gamma=1$ , equation (29) becomes

[TABLE]

and if

[TABLE]

then (16) is satisfied.

The upper bound on $\Delta_{k}$ for $k\in{\cal I}_{3}$ is sharper than the one obtained for $k\in{\cal I}_{2}$ . Then, due to the rule used to decrease $\Delta_{k}$ in Step 5, we can conclude that, at iteration $k\in{\cal I}_{2}\cup{\cal I}_{3}$ , condition (16) is satisfied if

[TABLE]

and the claim follows. $\Box$

Theorem 5.2

Let Assumptions 2.1–2.4 hold. Suppose furthermore $\gamma=1$ in Algorithm iretr and let $f_{low}$ the lower bound of $f_{N}$ in $\Omega$ . Then,

i) the cardinality $|{\cal{I}}_{1}|$ satisfies

[TABLE]

with $\nu_{1}=\frac{\xi-\underline{\theta}f_{low}}{\eta^{2}(1-r)}$ , $\xi$ as in (32), $\underline{\theta}$ as in Lemma 2.2, $\eta$ and $r$ as in the Algorithm iretr.

ii) the cardinality $|{\cal{I}}_{2}|+|{\cal{I}}_{3}|$ satisfies

[TABLE]

with positive $\nu_{2}=\frac{2}{\eta\Gamma}\left(f_{N_{0}}(x_{0})-f_{low}+(\sigma\eta+1-\underline{\theta})\frac{\xi-\underline{\theta}f_{low}}{\eta^{2}(1-r)}\right)$ , $\nu_{3}=\nu_{2}\Gamma\sqrt{\mu}$ .

Proof. Let us denote with $\bar{k}$ the last iterate of Algorithm iretr and note that $N_{\bar{k}}=N$ by definition of the algorithm. From (31) it follows

[TABLE]

and consequently (30) yields

[TABLE]

Then the number of indices $k$ such that $h(N_{k})>\underline{h}$ is bounded above by

[TABLE]

and $i)$ follows.

Let us consider the case $k\in{\cal I}_{2}\cup{\cal I}_{3}$ . Note that by (18), (16), (17), (21) and (13), we have

[TABLE]

Then, by using (53) and (24) it follows

[TABLE]

Moreover, note that due to the definition of $Ared_{k}(\theta_{k+1})$ and inequalities (19) and (16), the following inequality holds at termination of each iteration $k\geq 0$ :

[TABLE]

Then, since $\frac{Ared_{k}(\theta_{k+1})}{\theta_{k+1}}$ is positive,

[TABLE]

and this implies

[TABLE]

This implies

[TABLE]

Then, (58), (55), (56) and $h(N_{\bar{k}})=0$ yield

[TABLE]

and claim $ii)$ follows. $\Box$

Considering that $\varepsilon_{g}$ is an optimality measure and $\underline{h}$ is expected to be small, it is reasonable to suppose that

[TABLE]

Under this condition, Theorem 5.2 gives the iteration complexity

[TABLE]

As a consequence, for suitable values of $\underline{h}$ , the worst-case iteration complexity ${\cal{O}}(\varepsilon_{g}^{-2})$ of the standard trust-region method is retained, despite inaccuracy in functions and gradients. This result is stated below, where we count the number of iterations needed to satisfy $\|\nabla f_{N}(x_{k})\|\leq\varepsilon_{g}$ or $\|\nabla f_{N_{k+1}}(x_{k})\|\leq\varepsilon_{g}$ and $N_{k}=N$ , i.e., iterations in ${\cal I}_{1}\cup{\cal I}_{2}\cup{\cal I}_{3}$ and iteration $\bar{k}$ .

Corollary 5.3

Let Assumptions 2.1–2.4 hold. Assume furthermore $\gamma=1$ in Algorithm iretr. Then, there exists a constant $\nu_{4}>0$ such that Algorithm iretr needs at most

[TABLE]

iterations, provided that $\underline{h}^{-1}={\cal{O}}(\varepsilon_{g}^{-2})$ and (59) holds.

In case $h(M)=(N-M)/N$ , it holds $\underline{h}=1/N$ and $\underline{h}^{-1}={\cal{O}}(\varepsilon_{g}^{-2})$ implies $N={\cal{O}}(\varepsilon_{g}^{-2})$ . In case $N$ is larger, the number of iterations taken before full-accuracy is reached may deteriorate the complexity of the standard trust-region approach.

In order to derive the worst-case function evaluation complexity we need to bound the total number of trust-region reductions as each trust-region reduction calls for one (possibly subsampled) function evaluation at trial point $x_{k}+p_{k}$ .

Theorem 5.4

Let Assumptions 2.1–2.4 hold. Assume furthermore $\gamma=1$ in Algorithm iretr and let ${\cal T}_{j}$ be the number of trust-region reductions at a generic iteration $j$ of the algorithm. Then, for any $k\geq 1$ ,

[TABLE]

where

[TABLE]

Proof. Let us proceed by induction. By the updating rules of the trust-region radius in Step 5 of Algorithm iretr, at termination of the iteration $j=0$ we have

[TABLE]

Then, assume that at iteration $k\geq 1$

[TABLE]

with $w_{k}=\sum_{j=0}^{k-1}{{\cal T}_{j}}$ . At the end of iteration $k$ , after ${\cal T}_{k}$ reductions of the trust-region radius we have

[TABLE]

and consequently,

[TABLE]

i.e., (60) holds for any $k\geq 1$ . Taking into account that Lemma 5.1 ensures that iteration $k$ terminates with $\Delta_{k}\geq\underline{\Delta}$ , in the adverse case where the initial $\Delta_{k}$ is given by $\zeta_{2}^{k}\zeta_{1}^{w_{k}}\Delta_{0}$ (see (60)), at termination of iteration $k$ we are ensured that

[TABLE]

This yields the thesis, taking into account that $\zeta_{1}<1$ . $\Box$ Using the previous results we can now state our function evaluation complexity result.

Corollary 5.5

Let Assumptions 2.1–2.4 hold. Assume furthermore $\gamma=1$ in Algorithm iretr. Then, if $\underline{h}^{-1}={\cal{O}}(\varepsilon_{g}^{-2})$ and $\Delta_{0}$ satisfies (59) and it is independently of $\varepsilon_{g}$ , there exists a constant $\nu_{5}$ such that Algorithm iretr needs at most

[TABLE]

function evaluations, where $\nu_{4}$ is given in Corollary 5.3.

Proof. Assumption $\underline{h}^{-1}={\cal{O}}(\varepsilon_{g}^{-2})$ , (59) and $\Delta_{0}$ independent of $\varepsilon_{g}$ ensure $\underline{\Delta}=\nu_{5}\varepsilon_{g}$ , for some positive $\nu_{5}$ . Then Corollary 5.3 and Theorem 5.4 yield the thesis. $\Box$

6 Numerical experiments

In this section we report on our numerical experience with Algorithm iretr employing the second order model (5) and $D_{k+1}$ equal to a fixed fraction of $N_{k+1}$ . Our aim is to show that our adaptive and deterministic strategy for choosing the sample size $N_{k}$ and the use of subsampled functions, gradients and Hessians is effective and provides a gain in the overall computational cost with respect to a standard trust-region approach. To this end, we compare our method with “standard” trust-region implementations, i.e. implementations where functions and gradients are computed at full accuracy too. Specifically, we compare with the implementation, named statr_sh, employing full functions and gradients and subsampled Hessian $B_{k}$ as in (33) with $D_{k+1}=\left\lceil 0.1N\right\rceil$ , and with the implementation, named statr_fh, where functions, first and second order derivatives are computed at full accuracy.

All the results have been obtained running a Matlab R2019b code on an Intel Core i5-6600K CPU 3.50 GHz x 4, 16.0GB RAM.

6.1 Test problems

We tested our method both on convex and nonconvex problems arising in binary classification problems. Let $\{(a_{i},b_{i})\}_{i=1}^{N}$ denote the pairs forming the data set with $a_{i}\in\hbox{\rm I\kern-1.99997pt\hbox{\rm R}}^{n}$ being the vector containing the entries of the $i$ -th example and $b_{i}$ being its label. The data set we employed are displayed in Table 1. In the table for each data set we report the number $N$ of training examples and the dimension $n$ of each instance. Moreover we report the number of elements in the testing set $N_{T}$ .

We performed a logistic regression to solve classification problems associated to the data sets Mushrooms, Cina0 and Gisette. In this case $b_{i}\in\{-1,+1\}$ and the strongly convex objective function is given by the logistic loss with $\ell_{2}$ -regularization

[TABLE]

Classification problems associated with the remaining data sets were solved using the sigmoid function and least-squares loss. Here $b_{i}\in\{0,+1\}$ and the non-convex objective function has the form

[TABLE]

6.2 Implementation issues

The trust-region parameters of the procedures under comparison are fixed as

[TABLE]

The trust-region problem is solved approximately using CG-Steihaug method [16]. The Conjugate Gradient (CG) method is applied without preconditioning and the procedure is stopped when the relative residual becomes smaller than $10^{-3}$ or a maximum of $100$ iterations is performed. In Step 5, in case of successful iterations, we update the trust-region radius as follows. If ${\rm{Ared}}_{k}(\theta_{k+1})/{\rm{Pred}}_{k}(\theta_{k+1})\geq 1.1$ we set $\Delta_{k+1}^{(0)}=\zeta_{2}\Delta_{k}^{({\cal T}_{k})}$ , otherwise we set $\Delta_{k+1}^{(0)}=\Delta_{k}^{({\cal T}_{k})}$ .

Focusing on Algorithms iretr, we tested two rules for choosing the sample size. In the first implementation, later referred to as iretr_d, the sample size varies dynamically. The infeasibility measure $h$ and the initialization parameters for inexact restoration are:

[TABLE]

The parameters $\gamma=1,\,\mu=100/N$ are used in (12). The updating rules for choosing $\widetilde{N}_{k+1}$ , $N_{k+1}$ in Steps 1 and 2 are the following:

[TABLE]

We note that the choice of $\widetilde{N}_{k+1}$ falls into (11) with $r=(N-0.2)/N$ .

In the second implementation, we set again

[TABLE]

Then, the sample size $N_{k+1}$ is increased according the geometric growth:

[TABLE]

We will refer to this implementation as iretr_gg. We note that this choice of $N_{k+1}$ amount to choosing $\mu=0$ in (12).

In both implementations iretr_d and iretr_gg the first time that $N_{k}=N_{k+1}=N$ occurs, then the value of the trust-region radius is set to $\Delta_{k}^{({\cal T}_{k})}=\max\{1,\,\Delta_{k}^{({\cal T}_{k})}\}$ . Moreover, the Hessian matrix $B_{k}$ is formed via (33) with

[TABLE]

Thus, the Hessian sample size changes dynamically until the full sample for function and gradient is reached. The sets $I_{N_{k+1}}$ and $I_{D_{k+1}}$ are generated using the Matlab function randsample with no replacement. When the sample size $N_{k+1}$ is increased, the new sample set can be computed from scratch or can be obtained randomly adding new samples to the previous sample set. Despite this latter choice produces computational savings, in view of a truly random process we generate each $I_{N_{k+1}}$ from scratch.

Concerning the stopping criteria, for all the algorithms under comparison, we imposed a maximum of $1000$ iterations and we declared a successful termination when one of the two following conditions is met

[TABLE]

with $\varphi=10^{-4}$ . We underline that for iretr_d and iretr_gg the above checks are on possibly subsampled functions and gradients and we allow for termination before full precision is reached.

The initial guess is $x_{0}=(0,\ldots,0)^{T}$ for all runs.

6.3 Numerical results

The first set of results presented shows the performance of Algorithms iretr_d, iretr_gg, statr_sh and statr_fh. In our test problems, the main cost in the computation of $\phi_{i}$ for any $1\leq i\leq N$ is the scalar product $a_{i}^{T}x$ . Once this product is evaluated, it can be reused for computing $\nabla\phi_{i}$ and $\nabla^{2}\phi_{i}$ . In particular, computing $\nabla^{2}\phi_{i}$ times a vector $v$ at each CG iteration requires a scalar product $a_{i}^{T}v$ i.e., it is as expensive as evaluating $\phi_{i}$ . Therefore, if one full function evaluation is denoted as nfe, computing $f_{M}$ costs $\displaystyle\frac{M}{N}$ nfe while each CG iteration costs $\displaystyle\frac{D_{k+1}}{N}$ nfe. Since the selection of sets $I_{N_{k+1}}$ and $I_{D_{k+1}}$ in Algorithms iretr_d, iretr_gg and statr_sh is random, the cost associated to such algorithms is measured on average over 50 runs.

In Table 2 for each method and for each data set we report the number nfe of full function evaluations performed and the percentage of saving obtained by Algorithm iretr_d with respect to iretr_gg, statr_sh and to statr_fh. First, we can observe that Algorithm iretr_d is in general less costly than the variant iretr_gg; this indicates that the dynamic choice of the sample size, aiming to make slow progress to full precision, is effective and does not deteriorate the performance of iretr when the geometrical growth of the sample size is the most effective (see the results for Htru2). Second, we observe a remarkable saving of both iretr_d and iretr_gg with respect to the full standard trust-region for all the data sets used; compared to statr_sh the saving is lower, as expected, but still considerable overall.

To give more insight into the two implementations iretr_d, in Figures 1 and 2 we plot the sample size $N_{k}$ versus the iterations for Mushrooms and A9a problems. The dashed line plots $N_{k+1}=\left\lceil(1.2)^{k}N_{0}\right\rceil$ versus iterations, that is the sample size corresponding to the geometric growth used in iretr_gg. The increase of $N_{k}$ along iterations in iretr_d is considerably slower than that provided by the geometric growth; in two runs, the cardinality $N_{k}$ in iretr_d reaches the value $N$ , as expected from the theory, but in the first phase of the iterative process it is a small fraction of $N$ and decreases at some iterations. In the other two runs, iretr_d does not reach full precision, iterations terminate with a cardinality $N_{k+1}=2780$ , corresponding to the 56% of the training set and $N_{k+1}=16495$ , corresponding to the 72% of the training set, respectively. In fact, despite the adaptive strategy of iretr yields $N_{k}=N$ for $k$ sufficiently large, our stopping rule (62) is applied on possibly subsampled functions and gradients. This feature is in accordance with the motivations for using subsampling: data in a training set show redundancy and in general using subsets of the sample data is enough to provide a small testing error. At this regard, consider Figure 3 related to the data set Mushrooms, $N=5000$ . At each iteration and for three runs corresponding to different sample sizes at termination, we plot the training loss $f_{N_{k}}(x_{k})$ versus the value of $N_{k}$ ; at termination: $N_{k}$ =1941 (dashed line), $N_{k}$ = 4241 (dash-dotted line), $N_{k}=N$ (solid line). We also display the testing loss $f_{N_{T}}$ at termination. Although in two runs the final sample size is approximately 39% and 85% of the data in the training set, interestingly the testing loss is in between $1\cdot 10^{-1}$ and $3\cdot 10^{-1}$ in all runs. Thus, monitoring the values of subsampled functions and gradients in (62) is effective.

The previous discussion is supported by further observations. In Figure 4, we plot the value of the training loss versus the number of function evaluations required to solve Mushrooms and Htru2 problems with iretr_d, statr_sh and statr_fh. In these runs, iretr_d terminates with $N_{k}=N$ in Mushrooms problem while terminates with $N_{k}=7426$ (74% of the samples) in Htru2 problem. At termination, the values of both the training loss and the testing loss provided by the three methods are similar and this feature further supports both termination before full precision is reached and the inexact restoration approach for handling subsampled functions and derivatives.

Finally, Figure 5 refers to the dataset Cina0 and displays the values of the training and testing logistic loss along the iterations of iretr_d using the tolerance $\varphi=10^{-8}$ in (62). In the progress of the iterations the loss values settle and performing the last thirteen iterations is pointless.

Acknowledgement Dedicated with friendship to José Mario Martínez for his outstanding scientific contributions.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bastin F., Cirillo C., Toint P.L., An adaptive Monte Carlo algorithm for computing mixed logit estimators, Computational Management Science 3(1), 55-79, 2006.
2[2] Bastin F., Cirillo C., Toint P.L., Convergence theory for nonconvex stochastic programming with an application to mixed logit, Mathematical Programming, 108, 207-234, 2006.
3[3] Bellavia, S., Gurioli, G., Morini, B., Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization, IMA J. Numerical Analysis, 2020, drz 076, https://doi.org/10.1093/imanum/drz 076
4[4] Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L., Adaptive regularization algorithms with inexact evaluations for nonconvex optimization, SIAM Journal on Optimization, 29(4), pp. 2281–2915, 2019.
5[5] Bellavia, S., Krejić, N., Krklec Jerinkić, N., Subsampled Inexact Newton methods for minimizing large sums of convex function, IMA Journal of Numerical Analysis, 2019, https://doi.org/10.1093/imanum/drz 027
6[6] Berahas A. S., Bollapragada R., Nocedal J., An Investigation of Newton-Sketch and Subsampled Newton Methods, Optimization Methods and Software, 2020, https://doi.org/10.1080/10556788.2020.1725751
7[7] Birgin, G.E., Krejić, N., Martínez, J.M., On the employment of Inexact Restoration for the minimization of functions whose evaluation is subject to programming errors, Mathematics of Computation 87(311), 1307-1326, 2018.
8[8] Birgin, G.E., Krejić, N., Martínez, J.M., Iteration and evaluation complexity on the minimization of functions whose computation is intrinsically inexact, Mathematics of Computation, 89, 253-278, 2020.