Inexact restoration with subsampled trust-region methods for finite-sum minimization
Stefania Bellavia, Natasa Krejic, Benedetta Morini

TL;DR
This paper introduces a new trust-region method for finite-sum minimization that uses deterministic subsampling for approximating functions, gradients, and Hessians, improving efficiency over standard methods.
Contribution
It proposes a novel inexact restoration-based trust-region approach with deterministic sample size control for better computational efficiency.
Findings
More efficient than standard trust-region methods with subsampled Hessians.
Provides local and global convergence properties for approximate optimal points.
Achieves favorable function evaluation complexity results.
Abstract
Convex and nonconvex finite-sum minimization arises in many scientific computing and machine learning applications. Recently, first-order and second-order methods where objective functions, gradients and Hessians are approximated by randomly sampling components of the sum have received great attention. We propose a new trust-region method which employs suitable approximations of the objective function, gradient and Hessian built via random subsampling techniques. The choice of the sample size is deterministic and ruled by the inexact restoration approach. We discuss local and global properties for finding approximate first- and second-order optimal points and function evaluation complexity results. Numerical experience shows that the new procedure is more efficient, in terms of overall computational cost, than the standard trust-region scheme with subsampled Hessians.
| Data set | nfe | nfe(save) | ||
|---|---|---|---|---|
| iretr_d | iretr_gg | statr_sh | statr_fh | |
| Mushrooms | 27 | 30 (10%) | 51 (47%) | 108 (75%) |
| Cina0 | 88 | 99 (11%) | 96 (08%) | 416 (78%) |
| Gisette | 346 | 362 (04%) | 432 (20%) | 594 (42%) |
| A9a | 22 | 25 (12%) | 45 (51%) | 445 (95%) |
| Covertype | 17 | 23 (26%) | 48 (65%) | 698 (98%) |
| Ijcnn1 | 20 | 25 (20%) | 36 (44%) | 128 (84%) |
| Mnist | 46 | 50 (08%) | 58 (20%) | 955 (95%) |
| Htru2 | 38 | 37 ( -3%) | 43 (12%) | 87 (56%) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Inexact restoration with subsampled trust-region methods for finite-sum minimization333The work of Bellavia and Morini was supported by Gruppo Nazionale per
il Calcolo Scientifico (GNCS-INdAM) of Italy. The work of the second author was supported by Serbian Ministry of Education, Science and Technological Development, grant no. 174030. Part of the research was conducted during a visit of the second author at Dipartimento di Ingegneria Industriale supported by Piano di Internazionalizzazione, Università degli Studi di Firenze.
Stefania Bellavia111Dipartimento di Ingegneria Industriale, Università degli Studi di Firenze, Viale G.B. Morgagni 40, 50134 Firenze, Italia. Members of the INdAM Research Group GNCS. Emails: [email protected], [email protected], Nataa Krejić222 Department of Mathematics and Informatics, Faculty of Sciences, University of Novi Sad, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia, Email: [email protected]. , Benedetta Morini11footnotemark: 1
Abstract
Convex and nonconvex finite-sum minimization arises in many scientific computing and machine learning applications. Recently, first-order and second-order methods where objective functions, gradients and Hessians are approximated by randomly sampling components of the sum have received great attention.
We propose a new trust-region method which employs suitable approximations of the objective function, gradient and Hessian built via random subsampling techniques. The choice of the sample size is deterministic and ruled by the inexact restoration approach. We discuss local and global properties for finding approximate first- and second-order optimal points and function evaluation complexity results. Numerical experience shows that the new procedure is more efficient, in terms of overall computational cost, than the standard trust-region scheme with subsampled Hessians.
**Keywords: inexact restoration, trust-region methods, subsampling, local and global convergence, worst-case evaluation complexity. **
1 Introduction
The problem we consider in this paper is the following
[TABLE]
where is very large and finite and . A number of important problems can be stated in this form, to start with problems in machine learning like classification problems, data fitting problems, sample average approximation of the objective function given in the form of mathematical expectation and so on.
The practical relevance of (1) resulted in a number of methods that are adjusted to this particular form of the objective function. In fact, for very large the cost of evaluating might be really high and the same is true for the gradient and even more for the Hessian evaluation. Therefore a number of methods that use approximate objective functions and/or first and second order derivatives, formed by partial sums, is proposed and analysed in literature, see e.g., [3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 19, 20, 25, 36, 37, 38].
Concerning the approximation of the objective function, one of the possible approaches is to use relatively rough approximations at early stages of the optimization procedure and gradually increase the accuracy to arrive at full precision at the late stage of the iterative procedure; the gradient is approximated accordingly. This way one hopes to save computational effort and yet to solve the original problem eventually. Very often the term scheduling is used to describe the approximation of the objective function by means of a partial sum. There is a number of algorithms proposed for the scheduling problem, ranging from simple heuristics that increase the number of terms in the partial sum that approximates the objective function by a certain percentage in each iteration, [5, 10, 20, 36] to more elaborate schemes that connect the progress achieved during the optimization procedure to the number of terms in the partial sum [1, 2, 3, 4, 5, 7, 8, 9, 13, 17, 27, 28, 29, 33, 35].
Besides the problem of scheduling, one has to decide between first- and second-order optimization method to be employed. A detailed survey is presented in [11]. A number of first-order methods has been proposed and analysed in the literature. Given that the main cost comes from large one might be tempted to conclude that computing Hessians, or some other second order information might be prohibitively costly and thus opt for a first order method, especially if the problem (1) should be solved with limited precision. However, recently there has been reported in several papers that careful adjustment and implementation of second order methods might be worth considering if the true Hessian is approximated by a partial sum of Hessians consisting of a significantly smaller number of terms than . This way one can generate useful information with significantly smaller cost than the true Hessian and get enough advantage over first-order methods in terms of resilience to problem ill-conditioning and low sensitivity to parameter tuning, [6, 5, 10, 13, 12, 19, 34, 37, 38, 36].
The method we present here combines the Inexact Restoration (IR) framework with the trust-region optimization method [16] to simultaneously design the scheduling and the optimization procedure for solving (1) and represents a new approach for the problem under consideration.
The Inexact Restoration method, introduced in [31], is a constrained optimization tool particularly suitable for problems where one does not want to enforce feasibility in all iterations. The key idea of the IR approach is to treat feasibility and optimality in a modular way and to improve each one in separate procedures; the combination of feasibility and optimality is then monitored through a suitable merit function. Each iteration ensures the sufficient decrease of a suitable merit function and therefore, under certain assumption, convergence to a feasible optimal point. In [30, 31] the combination of the IR strategy with trust-region methods is proposed and analysed for general constrained problems.
The application of IR strategy to the unconstrained optimization problem (1) requires its reformulation as a constrained problem. Letting be an arbitrary nonempty subset of of cardinality equal to , we reformulate problem (1) as
[TABLE]
Evaluating infeasibility in (2) is cheap while computing the objective function is expensive whenever is large. Thus, using the reasoning from [30, 31] we define a new algorithm that exploits the structure of the problem considered and takes advantage of the modular structure of IR and the trust-region optimization method at the same time. Specifically, the trust-region mechanism is applied to model at each iteration and the IR framework is applied to test for the acceptance of the iterates and to determine the scheduling sequence, i.e. the value of through the iterations. The test acceptance of the new iterate allows us to deal with inaccuracy in function and derivatives. In particular, the number of terms in the partial sum is fixed at the beginning of each iteration in the restoration phase and possibly changed in the optimality phase where the trial iterate is computed.
Clearly, the higher feasibility is the more accurate is with respect to . The new procedure has two important properties: partial sums, possibly consisting of small sets of ’s, can be used in the early stage of the iterative procedure to decrease the computational cost; the original objective function in (1) is recovered for all iteration indices large enough, thus allowing for the solution of the given problem. Clearly, when full precision of the objective function and the gradient is reached, one can rely on the theory and machinery of standard trust-region methods [16].
The scheme presented here applies to both first- and second-order trust-region models. If a linear model is used, the resulting procedure is a subsampled gradient method with variable stepsize. When second-order models are used, the Hessian can be approximated using a subset of the sample used to approximate function and gradient. The error in such Hessian approximation plays an important role in the asymptotic convergence rate. In the case of strongly convex problems, the analysis for local linear convergence rate is presented, both in deterministic and probabilistic settings, and an adaptive choice of the sample for Hessian approximation is proposed.
We also provide a function evaluation complexity result which resembles the classical result for the trust-region methods for (1) and the results obtained in [8]. It is shown that at most evaluations of the possibly subsampled function , , and its derivatives are needed to compute a first-order approximate critical point. Then the worst-case complexity of the standard trust-region is recovered with expected significant computational savings due to scheduling.
Our approach considerably differs from the IR procedure and trust-region method in [30, 31] since the objective function in our formulation changes with through the iterations. It also differs from IR approaches in [29, 7, 8] that employ approximate objective function and its derivatives and have been successfully applied to constrained and unconstrained problems, including problem (1); in papers [29, 7] the IR is combined with a line search strategy, while in [8] the considered problem is constrained and regularization techniques are used in the optimization phase. The approach presented here relays on [8] in terms of general idea but the problem is more specific being a finite-sum rather than a general objective function computed approximately and being unconstrained. These specifications allow us to design an efficient sample update rule which is connected with the trust-region size.
The value of is fixed via a deterministic rule while the trust-region schemes in [25, 38, 9], approximating either functions, gradients and Hessians [25, 9] or Hessians only [38], are designed using sample sets whose cardinality is determined by high probability and nonasymptotic convergence analysis.
The nature of IR allows changes in the feasibility through iterations and the change is not necessarily monotone, i.e., the cardinality of the subset that defines the approximate objective can both increase and decrease, depending on the feedback from the trust-region progress made in each iteration. The case where is increased by a prefixed percentage at each iteration is a particular case of our strategy. In this latter case our method differs from a straightforward subsampled trust-region procedure with increasing sample size in both the merit function and the acceptance criterion. Remarkably, their employment allow to prove optimal complexity results that otherwise require adaptive accuracy requirements [9].
This paper is organized as follows. In Section 2 we present our method and prove that it is well defined. Furthermore, we prove that full accuracy is eventually reached and that the set of standard assumptions yield first-order stationary points. Some issues concerning the realization of the procedure are considered in Section 3; the scheduling rule is modified to avoid unproductive decrease in precision and a discussion on first and second order trust-region models is provided. Section 4 deals with strongly convex problems; we prove q-linear convergence as well as q-linear convergence in expectation under probabilistic bounds for Hessian subsampling. Section 5 provides worst-case function evaluation complexity. The numerical performance of the proposed method is tested on a set of classification problems and the results are reported in Section 6.
2 The Algorithm
Let be an arbitrary nonempty subset of of cardinality equal to
[TABLE]
and reformulate (1) as the constrained problem (2). We measure the level of infeasibility with respect to the constraint by the function with the following properties.
Assumption 2.1
Let be a monotone, strictly decreasing function such that , .
This assumption implies
[TABLE]
for and and . One possible choice for is .
Suppose , , be continuously differentiable and let denote the 2-norm.
The method introduced in this section combines the Inexact Restoration, an approach for optimization of functions evaluated inexactly, with the trust-region methods. We will refer to it as iretr. It employs the merit function
[TABLE]
with and aims to minimize both and the infeasibility . Since the reductions in the values of and may not be achieved simultaneously, a weight is used and a trust-region method is employed to generate a sequence such that . The main theoretical properties of the new method, shown in the next section, are: the sequence is nonicreasing and uniformly bounded away from zero, for all sufficiently large and as .
Concerning the trust-region problem, suppose that is given. Then, a trial sample size is selected, is chosen and the model for around of the form
[TABLE]
is built. Here denotes the gradient of and is a symmetric approximation to the Hessian in case , , are twice continuously differentiable. Trivially and the smaller , the larger becomes the accuracy in the approximation to and . Then, letting denote the trust-region radius and be the trust-region, the trust-region problem is
[TABLE]
As in the standard trust-region schemes, problem (6) is solved approximately and the computed step is required to provide a sufficient reduction in the model in terms of the Cauchy step , i.e., the minimizer of the model along the steepest descent within
[TABLE]
Then, if a sufficient reduction in the function is achieved, the step is accepted and the new iterate is set equal to . Otherwise, the step is rejected and the trust-region radius is reduced. The specific form of the predicted and actual reduction used in the acceptance criterion will be given below, after detailing the Algorithm’s steps.
Now we present the new Algorithm iretr which aims at finding an –accurate first-order optimality point defined as follows
[TABLE]
and comment on it, see Algorithm 1.
Given , and we describe the th iteration. In Step 1 the feasibility is improved. If , we predict the cardinality such that the value is smaller than and at most equal to a prefixed fraction of . In case , taking into account that and are integers it can be shown that condition (11) holds if and only if provided that .
In Step 2, an attempt is made to reduce the computational effort i.e. to enlarge infesibility; is chosen such that and the bounded deterioration (12) on the value of with respect to is imposed. In principal such control allows us to reduce below both and . On the other hand, the upper bound in (12) depends on the trust-region radius and will be equal to whenever is small enough. If , the stopping criterion is checked. This is supported by the fact that, when , we may expect be close to and be close to in a probabilistic sense; we will further discuss this issue in Section 3. If (10) is not met, using , the trust-region model is built and (6) is approximately solved. The computed step is required to provide the sufficient reduction (13) in the model in terms of the Cauchy step .
The acceptance rule for in Step 5 depends on the predicted and actual reduction defined as follows:
[TABLE]
where the last equality follows from (4). We observe that uses the last accepted values and and is a linear combination of two predicted values: the predicted model decrease and the predicted infeasibility decrease . As for , given , it measures the actual reduction of .
The new penalty parameter computed in Step 4 is the largest value that ensures
[TABLE]
as by (11). In case such condition implies strictly positive. In case , reduces to and from (13) it follows whenever . On the other hand, in case and , Step 3 is necessary to enforce positivity of as . In fact, follows from taking a step such that . We further notice that attempting when is meaningful if the model is a good approximation of around and thus one can expect some progress, or at least a limited deterioration in the value of the full objective function . Enforcing is a minimal requirement on the agreement between at and the model at the trial step.
Finally, in Step 5 the step is accepted if the ratio between the predicted reduction and the actual reduction is larger than a prefixed scalar , otherwise the trust-region radius is reduced and the procedure is repeated starting from Step 2.
Notice that the trust-region size can be reduced several times during one iteration, i.e., only successful iterations yield to the increment of the iteration counter . To emphasize this fact, within each iteration, we introduce an additional counter for the number of decreases of the trust-region size. The feasibility measure might be modified several times within one iteration as well, but changes due to (12) and (14) do not necessarily correspond to the number of reductions of the trust-region size. The penalty parameter has an analogous behaviour. For this reason and to avoid notation clustering, we do not introduce additional counters for and within the same iteration.
We start the analysis of the new method proving that the th iteration of Algorithm iretr is well defined since appropriate values of and will be reached in a finite number of attempts. Here and in Section 5, can be the null matrix and our analysis covers the use of both first-order and second-order models.
Lemma 2.1
Steps 2 and 3 of Algorithm iretr are well-defined.
Proof. For any positive inequality (12) trivially holds in the limit case . Analogously, Step 3 can not be repeated infinitely many times as for large enough, will be small enough to yield .
We now make the following assumption.
Assumption 2.2
* where is a compact set in .*
Lemma 2.2
Let Assumptions 2.1 and 2.2 hold. Suppose that , , are continuous in Then the sequence built in Algorithm iretr is positive, nonincreasing and bounded away from zero, with independent of and (19) holds.
Proof. We have and proceed by induction assuming that is positive. First consider the case where (equivalently ). Then and, due to Step 3, for any positive . Thus and (19) holds.
Let now suppose . If inequality holds then satisfies (19). Otherwise, we have
[TABLE]
and since the right hand-side is negative by construction, it follows
[TABLE]
Consequently, is satisfied if
[TABLE]
i.e., if
[TABLE]
Hence is the largest value satisfying (19) and
Let us now prove that Using Assumptions 2.2 and continuity of , , let
[TABLE]
Then, using (3), for such that there holds
[TABLE]
and therefore for any integer ,
[TABLE]
Also note that by (11) and (3)
[TABLE]
Moreover,
[TABLE]
and in (15) satisfies
[TABLE]
and the proof is completed.
To establish the well-definiteness of Steps 4 and 5, we make the following assumptions.
Assumption 2.3
The gradients , , are Lipschitz continuous on the segments , for all and for all generated in the repetition of Steps 2–5.
Assumption 2.4
There exists positive such that for all
[TABLE]
By Assumption 2.3 there is a such that
[TABLE]
[18, Lemma 4.1.2]. Consequently, using Assumptions 2.2–2.4 we have
[TABLE]
with and depending on the Lipschitz constants of , .
In the next result we use the key inequality
[TABLE]
with , see [16, Theorem 6.3.1].
Lemma 2.3
Let Assumptions 2.1– 2.4 hold. Assume and as in (15). Then, Steps 4 and 5 of Algorithm IRETR are well defined.
Proof. Let us prove that is strictly positive if is small enough, i.e., after a finite number of reductions of the trust-region radius. Let be computed at Step 4 for some By (17) and (18), we have
[TABLE]
We now distinguish three cases.
If then using (19) we get
[TABLE]
The first term in the above right hand-side is strictly positive and uniformly bounded from below due to (22). On the other hand, by (23) and (12)
[TABLE]
Therefore, for small enough we have and the iteration finishes.
If (equivalently ) and then using (17) and (18) we have
[TABLE]
Thus, by (13), (23) and (24), if is small enough we get
[TABLE]
and the last bound is positive for some finite .
Finally, suppose (equivalently ) and then using (17) and (18) we have
[TABLE]
Thus, by Step 3 of Algorithm 2.1, (23) and (24), if is small enough we get
[TABLE]
and the last bound is positive for some finite .
The analysis presented in the rest of this section concerns the case where Algorithm iretr is invoked with and does not terminate in a finite number of steps. Each iteration of the Algorithm ends up with the accepted iterate and the final sample size In the following statements we are going to prove that and therefore the full sample is eventually reached and maintained.
Theorem 2.4
Let Assumptions 2.1–2.4 hold. Then .
Proof. Inequalities (11) and (19) imply
[TABLE]
We prove by contradiction that .
Taking into account that at termination of iteration we have and , using (16) and (18) we have
[TABLE]
Using (20) we can rewrite the above inequality as
[TABLE]
Then using recurrence, and we get
[TABLE]
Repeating this argument, using from Lemma 2.2 and (3) we obtain
[TABLE]
[TABLE]
and therefore
[TABLE]
where
[TABLE]
is independent of .
Noting that , we can conclude that if is not tending to zero, then is diverging and this implies that is unbounded below in . This contradicts the compactness of .
Corollary 2.5
Let Assumptions 2.1–2.4 hold. Then for all sufficiently large.
Proof. By Theorem 2.4 and Assumption 2.1, it follows for all sufficiently large. This implies .
Corollary 2.6
Let Assumptions 2.1–2.4 hold. Then, for sufficiently large, the iterations are generated by a (standard) trust-region scheme on and
i) .
ii) , provided that is Lipschitz continuous in .
Proof. By Corollary 2.5 we know that at termination of iteration we have for all sufficiently large. Thus eventually, with satisfying (16) which now takes the form of the standard acceptance rule of the trial point in trust-region methods, i.e,
[TABLE]
As a consequence, Theorem 4.6 in [32] yields item . Item is guaranteed by [32, Theorem 4.7].
3 On the realization of the algorithm
The realization of Algorithm iretr raises many issues and in this section we discuss two important aspects: the form of the model used and related properties, and a computationally convenient adaptation of the rule for choosing eventually. We will further address implementation issues in Section 6.
Various models of the form (5) can be built. One possibility is the linear model
[TABLE]
which gives rise to a gradient method and step
[TABLE]
Namely, Algorithm iretr becomes a subsampled gradient method with variable stepsize determined accordingly to the trust-region strategy.
Another possibility is to use quadratic models of the form
[TABLE]
and fully exploit the advantages of the trust-region framework. If all functions are twice continuously differentiable one can build the quadratic model
[TABLE]
with and . In fact, the Hessian matrix is approximated via subsampling by
[TABLE]
The cardinality of now controls the precision of Hessian approximation and allows for trade-off between precision and computational costs. This particular form of Hessian approximation will be analysed in details for strongly convex functions in the next section.
The use of quadratic models is crucial for the computation of -approximate second order critical point of nonconvex problems (1), i.e., a point such that
[TABLE]
Supposing that full precision is reached, , the trust-region problem (6) has to be solved approximately finding such that
[TABLE]
where is the Cauchy point (9) and is a negative curvature direction such that for some , [16, §6.6].
We refer to [38, Theorem 1] for results on the computation of approximated second-order optimal solutions using trust-region methods with full function and gradient and subsampled Hessian.
Let us now address the choice of the stopping criterion in Algorithm iretr. Notice that the Algorithm may stop even if full precision at iteration is not achieved (i.e. ), provided that . This choice is supported by observing that suitable sample sizes provide an accurate approximation to . In fact, by [4, Theorem 6.2] is sufficiently accurate with fixed probability at least , i.e.,
[TABLE]
if the cardinality satisfies
[TABLE]
with and , and is sampled uniformly in .
We conclude this section observing that, in the current form of the algorithm, at each iteration an attempt is made to use (see Step 2). By Corollary 2.5 we know that, for sufficiently large, such a value will be rejected and this fact implies useless repetitions of Steps 2–5. To overcome this drawback, we replace (12) with
[TABLE]
Then, the following result holds.
Corollary 3.1
Suppose (37) and (38) hold. For sufficiently large, the use of sets of cardinality smaller than is not attempted.
Proof. By Corollary 2.5 and Corollary 2.6, we know that for all sufficiently large and tends to zero. Thus, letting be the iteration index such that , , it follows , .
4 Strongly convex problems
In this section we assume that is strongly convex with strongly convex functions , , and analyze the local behaviour of iretr method when full precision for the function and the gradient has been reached and a quadratic model of the following form is used:
[TABLE]
with , . Thus, we are focusing on the local behaviour of the trust-region method employing second order models with exact function and gradient and subsampled Hessian. Such a method has been investigated in [38] with respect to iteration complexity but not with respect to local convergence.
The additional assumptions used in this section are stated below.
Assumption 4.1
The functions , are twice continuously differentiable and strongly convex in ,
[TABLE]
where, given two matrices and , means that is positive semidefinite.
Trivially, is strongly convex and admits an unique minimizer . Moreover, is as in (33), both and hold and Corollary 2.6 implies .
The following theorem analyzes the behaviour of denoting
[TABLE]
the error between and . We also invoke the assumption below.
Assumption 4.2
The Hessian is Lipschitz continuous on with Lipschitz constant .
Theorem 4.1
Suppose that Assumptions 2.1, 2.2, 4.1, 4.2 hold. Let be generated by Algorithm iretr, as in (10), as in (24), as in the Algorithm iretr and given by (33).
i) Let and such that
[TABLE]
Then, if is sufficiently large, is accepted in the first pass in Step 5 and
*ii) There exist sufficiently small and sufficiently large such that, for all and the error reduces linearly, i.e., for some . *
Proof. Let us consider sufficiently large such that at termination of iteration . Lemma 6.5.1 in [16] gives
[TABLE]
Let us consider the step returned by iteration . Combining (42) with (24) and (13) we obtain
[TABLE]
with .
At Step 5 of the Algorithm, (16) has the form . By Assumption 4.2 and (40), it follows
[TABLE]
where is some scalar in [16, Theorem 3.1.2]. Now, given and as in (41), (42) and Corollary 2.6 imply for large enough, say , and (41) implies the acceptance of the step. Then, is not reduced and for any .
Using (42), Corollary 2.6 and item we can conclude that the trust-region bound becomes inactive for sufficiently large, i.e., the step
[TABLE]
is accepted eventually. Consequently, using multivariate calculus results [18, Lemma 4.1.12] and Assumption 4.1
[TABLE]
Thus, the claim follows if and are such that and satisfies (41).
Item above may require a rather large value which is adverse for practical computation. A more stringent condition on of the form yields quadratic convergence but again such might be very close to . We next investigate on the more realistic situation where the Hessian accuracy requirement in (41) is guaranteed only with high-probability and provide a linear convergence result in expectation.
Let us now suppose that, given an accuracy requirement , the probability of being smaller than is larger than :
[TABLE]
for . If the subsample is chosen randomly and uniformly, then the lower bound on the sample size ensuring (45) takes the form
[TABLE]
The above bound is derived in [5, Lemma 3.1] and a similar bound is given in [3, Lemma 4].
We now provide a linear convergence result in expectation; the step taken is the global minimizer of (6), i.e.,
[TABLE]
for some , see [16, Theorem 7.2.1].
Theorem 4.2
Suppose that Assumptions 2.1, 2.2, 4.1, 4.2 hold. Let be generated by Algorithm iretr invoked with in (10), as in (33) and being the global minimizer of (6). If (45) holds and there exists a such that for all
[TABLE]
then there exist , , sufficiently small such that
[TABLE]
for all large enough and some
Proof. Take , , such that
[TABLE]
for some Let large enough such that .
Denote by the event
[TABLE]
Then and where denotes the event does not occur. If happens then using multivariate calculus results [18, Lemma 4.1.12], Assumption 4.1, (47) and (49)
[TABLE]
Otherwise, if is realized then by (42) we have
[TABLE]
Therefore,
[TABLE]
where we have used (50) and .
5 Worst-case iteration and evaluation complexity to first-order critical points
In this section we provide an upper bound on the number of iterations and function-evaluations needed to find an -accurate first-order optimality point (10). The number of function-evaluations is intended as the number of evaluations of functions of the form , for some . We recall that a standard trust-region approach shows worst-case iteration and full function complexity for first-order optimality [22].
Recalling that is equivalent to , consider the following partition of iteration indices :
- •
,
- •
,
- •
.
The value of may change within iteration before acceptance of the iterate; above is the value at the end of iteration , i.e., the value used for building the accepted iterate .
Our analysis is carried out fixing in Algorithm iretr and the first result provides a lower bound on the trust-region radius at termination of iteration .
Lemma 5.1
Let Assumptions 2.1–2.4 hold. Suppose furthermore in Algorithm iretr. Then,
i) for any
[TABLE]
ii) for any ,
[TABLE]
for some positive and as in the Algorithm.
Proof. The initial may be reduced in Steps 3 and 5 of the Algorithm. Step 3 is performed only if .
Let us consider case . Since equation (26) becomes
[TABLE]
From (25), inequality (16) is satisfied whenever
[TABLE]
Thus, using (11), if
[TABLE]
then (16) holds and the claim follows from the rule for decreasing in Step 5 of Algorithm iretr.
Let us consider case . Concerning Step 3, it is performed as long as . Then, (12) ensures that at termination of the loop in Steps 2–3
[TABLE]
Concerning Step 5, first suppose and with as in (24). Using (27) we can conclude that if
[TABLE]
then (16) is satisfied.
Suppose now and . Using , equation (29) becomes
[TABLE]
and if
[TABLE]
then (16) is satisfied.
The upper bound on for is sharper than the one obtained for . Then, due to the rule used to decrease in Step 5, we can conclude that, at iteration , condition (16) is satisfied if
[TABLE]
and the claim follows.
Theorem 5.2
Let Assumptions 2.1–2.4 hold. Suppose furthermore in Algorithm iretr and let the lower bound of in . Then,
i) the cardinality satisfies
[TABLE]
with , as in (32), as in Lemma 2.2, and as in the Algorithm iretr.
ii) the cardinality satisfies
[TABLE]
with positive , .
Proof. Let us denote with the last iterate of Algorithm iretr and note that by definition of the algorithm. From (31) it follows
[TABLE]
and consequently (30) yields
[TABLE]
Then the number of indices such that is bounded above by
[TABLE]
and follows.
Let us consider the case . Note that by (18), (16), (17), (21) and (13), we have
[TABLE]
Then, by using (53) and (24) it follows
[TABLE]
Moreover, note that due to the definition of and inequalities (19) and (16), the following inequality holds at termination of each iteration :
[TABLE]
Then, since is positive,
[TABLE]
and this implies
[TABLE]
This implies
[TABLE]
Then, (58), (55), (56) and yield
[TABLE]
and claim follows.
Considering that is an optimality measure and is expected to be small, it is reasonable to suppose that
[TABLE]
Under this condition, Theorem 5.2 gives the iteration complexity
[TABLE]
As a consequence, for suitable values of , the worst-case iteration complexity of the standard trust-region method is retained, despite inaccuracy in functions and gradients. This result is stated below, where we count the number of iterations needed to satisfy or and , i.e., iterations in and iteration .
Corollary 5.3
Let Assumptions 2.1–2.4 hold. Assume furthermore in Algorithm iretr. Then, there exists a constant such that Algorithm iretr needs at most
[TABLE]
iterations, provided that and (59) holds.
In case , it holds and implies . In case is larger, the number of iterations taken before full-accuracy is reached may deteriorate the complexity of the standard trust-region approach.
In order to derive the worst-case function evaluation complexity we need to bound the total number of trust-region reductions as each trust-region reduction calls for one (possibly subsampled) function evaluation at trial point .
Theorem 5.4
Let Assumptions 2.1–2.4 hold. Assume furthermore in Algorithm iretr and let be the number of trust-region reductions at a generic iteration of the algorithm. Then, for any ,
[TABLE]
where
[TABLE]
Proof. Let us proceed by induction. By the updating rules of the trust-region radius in Step 5 of Algorithm iretr, at termination of the iteration we have
[TABLE]
Then, assume that at iteration
[TABLE]
with . At the end of iteration , after reductions of the trust-region radius we have
[TABLE]
and consequently,
[TABLE]
i.e., (60) holds for any . Taking into account that Lemma 5.1 ensures that iteration terminates with , in the adverse case where the initial is given by (see (60)), at termination of iteration we are ensured that
[TABLE]
This yields the thesis, taking into account that . Using the previous results we can now state our function evaluation complexity result.
Corollary 5.5
Let Assumptions 2.1–2.4 hold. Assume furthermore in Algorithm iretr. Then, if and satisfies (59) and it is independently of , there exists a constant such that Algorithm iretr needs at most
[TABLE]
function evaluations, where is given in Corollary 5.3.
Proof. Assumption , (59) and independent of ensure , for some positive . Then Corollary 5.3 and Theorem 5.4 yield the thesis.
6 Numerical experiments
In this section we report on our numerical experience with Algorithm iretr employing the second order model (5) and equal to a fixed fraction of . Our aim is to show that our adaptive and deterministic strategy for choosing the sample size and the use of subsampled functions, gradients and Hessians is effective and provides a gain in the overall computational cost with respect to a standard trust-region approach. To this end, we compare our method with “standard” trust-region implementations, i.e. implementations where functions and gradients are computed at full accuracy too. Specifically, we compare with the implementation, named statr_sh, employing full functions and gradients and subsampled Hessian as in (33) with , and with the implementation, named statr_fh, where functions, first and second order derivatives are computed at full accuracy.
All the results have been obtained running a Matlab R2019b code on an Intel Core i5-6600K CPU 3.50 GHz x 4, 16.0GB RAM.
6.1 Test problems
We tested our method both on convex and nonconvex problems arising in binary classification problems. Let denote the pairs forming the data set with being the vector containing the entries of the -th example and being its label. The data set we employed are displayed in Table 1. In the table for each data set we report the number of training examples and the dimension of each instance. Moreover we report the number of elements in the testing set .
We performed a logistic regression to solve classification problems associated to the data sets Mushrooms, Cina0 and Gisette. In this case and the strongly convex objective function is given by the logistic loss with -regularization
[TABLE]
Classification problems associated with the remaining data sets were solved using the sigmoid function and least-squares loss. Here and the non-convex objective function has the form
[TABLE]
6.2 Implementation issues
The trust-region parameters of the procedures under comparison are fixed as
[TABLE]
The trust-region problem is solved approximately using CG-Steihaug method [16]. The Conjugate Gradient (CG) method is applied without preconditioning and the procedure is stopped when the relative residual becomes smaller than or a maximum of iterations is performed. In Step 5, in case of successful iterations, we update the trust-region radius as follows. If we set , otherwise we set .
Focusing on Algorithms iretr, we tested two rules for choosing the sample size. In the first implementation, later referred to as iretr_d, the sample size varies dynamically. The infeasibility measure and the initialization parameters for inexact restoration are:
[TABLE]
The parameters are used in (12). The updating rules for choosing , in Steps 1 and 2 are the following:
[TABLE]
We note that the choice of falls into (11) with .
In the second implementation, we set again
[TABLE]
Then, the sample size is increased according the geometric growth:
[TABLE]
We will refer to this implementation as iretr_gg. We note that this choice of amount to choosing in (12).
In both implementations iretr_d and iretr_gg the first time that occurs, then the value of the trust-region radius is set to . Moreover, the Hessian matrix is formed via (33) with
[TABLE]
Thus, the Hessian sample size changes dynamically until the full sample for function and gradient is reached. The sets and are generated using the Matlab function randsample with no replacement. When the sample size is increased, the new sample set can be computed from scratch or can be obtained randomly adding new samples to the previous sample set. Despite this latter choice produces computational savings, in view of a truly random process we generate each from scratch.
Concerning the stopping criteria, for all the algorithms under comparison, we imposed a maximum of iterations and we declared a successful termination when one of the two following conditions is met
[TABLE]
with . We underline that for iretr_d and iretr_gg the above checks are on possibly subsampled functions and gradients and we allow for termination before full precision is reached.
The initial guess is for all runs.
6.3 Numerical results
The first set of results presented shows the performance of Algorithms iretr_d, iretr_gg, statr_sh and statr_fh. In our test problems, the main cost in the computation of for any is the scalar product . Once this product is evaluated, it can be reused for computing and . In particular, computing times a vector at each CG iteration requires a scalar product i.e., it is as expensive as evaluating . Therefore, if one full function evaluation is denoted as nfe, computing costs nfe while each CG iteration costs nfe. Since the selection of sets and in Algorithms iretr_d, iretr_gg and statr_sh is random, the cost associated to such algorithms is measured on average over 50 runs.
In Table 2 for each method and for each data set we report the number nfe of full function evaluations performed and the percentage of saving obtained by Algorithm iretr_d with respect to iretr_gg, statr_sh and to statr_fh. First, we can observe that Algorithm iretr_d is in general less costly than the variant iretr_gg; this indicates that the dynamic choice of the sample size, aiming to make slow progress to full precision, is effective and does not deteriorate the performance of iretr when the geometrical growth of the sample size is the most effective (see the results for Htru2). Second, we observe a remarkable saving of both iretr_d and iretr_gg with respect to the full standard trust-region for all the data sets used; compared to statr_sh the saving is lower, as expected, but still considerable overall.
To give more insight into the two implementations iretr_d, in Figures 1 and 2 we plot the sample size versus the iterations for Mushrooms and A9a problems. The dashed line plots versus iterations, that is the sample size corresponding to the geometric growth used in iretr_gg. The increase of along iterations in iretr_d is considerably slower than that provided by the geometric growth; in two runs, the cardinality in iretr_d reaches the value , as expected from the theory, but in the first phase of the iterative process it is a small fraction of and decreases at some iterations. In the other two runs, iretr_d does not reach full precision, iterations terminate with a cardinality , corresponding to the 56% of the training set and , corresponding to the 72% of the training set, respectively. In fact, despite the adaptive strategy of iretr yields for sufficiently large, our stopping rule (62) is applied on possibly subsampled functions and gradients. This feature is in accordance with the motivations for using subsampling: data in a training set show redundancy and in general using subsets of the sample data is enough to provide a small testing error. At this regard, consider Figure 3 related to the data set Mushrooms, . At each iteration and for three runs corresponding to different sample sizes at termination, we plot the training loss versus the value of ; at termination: =1941 (dashed line), = 4241 (dash-dotted line), (solid line). We also display the testing loss at termination. Although in two runs the final sample size is approximately 39% and 85% of the data in the training set, interestingly the testing loss is in between and in all runs. Thus, monitoring the values of subsampled functions and gradients in (62) is effective.
The previous discussion is supported by further observations. In Figure 4, we plot the value of the training loss versus the number of function evaluations required to solve Mushrooms and Htru2 problems with iretr_d, statr_sh and statr_fh. In these runs, iretr_d terminates with in Mushrooms problem while terminates with (74% of the samples) in Htru2 problem. At termination, the values of both the training loss and the testing loss provided by the three methods are similar and this feature further supports both termination before full precision is reached and the inexact restoration approach for handling subsampled functions and derivatives.
Finally, Figure 5 refers to the dataset Cina0 and displays the values of the training and testing logistic loss along the iterations of iretr_d using the tolerance in (62). In the progress of the iterations the loss values settle and performing the last thirteen iterations is pointless.
Acknowledgement Dedicated with friendship to José Mario Martínez for his outstanding scientific contributions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bastin F., Cirillo C., Toint P.L., An adaptive Monte Carlo algorithm for computing mixed logit estimators, Computational Management Science 3(1), 55-79, 2006.
- 2[2] Bastin F., Cirillo C., Toint P.L., Convergence theory for nonconvex stochastic programming with an application to mixed logit, Mathematical Programming, 108, 207-234, 2006.
- 3[3] Bellavia, S., Gurioli, G., Morini, B., Adaptive cubic regularization methods with dynamic inexact Hessian information and applications to finite-sum minimization, IMA J. Numerical Analysis, 2020, drz 076, https://doi.org/10.1093/imanum/drz 076
- 4[4] Bellavia, S., Gurioli, G., Morini, B., Toint, Ph.L., Adaptive regularization algorithms with inexact evaluations for nonconvex optimization, SIAM Journal on Optimization, 29(4), pp. 2281–2915, 2019.
- 5[5] Bellavia, S., Krejić, N., Krklec Jerinkić, N., Subsampled Inexact Newton methods for minimizing large sums of convex function, IMA Journal of Numerical Analysis, 2019, https://doi.org/10.1093/imanum/drz 027
- 6[6] Berahas A. S., Bollapragada R., Nocedal J., An Investigation of Newton-Sketch and Subsampled Newton Methods, Optimization Methods and Software, 2020, https://doi.org/10.1080/10556788.2020.1725751
- 7[7] Birgin, G.E., Krejić, N., Martínez, J.M., On the employment of Inexact Restoration for the minimization of functions whose evaluation is subject to programming errors, Mathematics of Computation 87(311), 1307-1326, 2018.
- 8[8] Birgin, G.E., Krejić, N., Martínez, J.M., Iteration and evaluation complexity on the minimization of functions whose computation is intrinsically inexact, Mathematics of Computation, 89, 253-278, 2020.
