Adaptive Sequential Machine Learning
Craig Wilson, Yuheng Bu, Venugopal Veeravalli

TL;DR
This paper extends a framework for adaptive sequential optimization to machine learning tasks, proposing methods to select sample sizes dynamically to control excess risk, validated through experiments on synthetic and real data.
Contribution
It introduces an adaptive sampling method based on minimizer change estimates for machine learning, enhancing efficiency in stochastic optimization.
Findings
The proposed method effectively controls excess risk.
Adaptive sampling reduces computational costs.
Experimental results validate the approach.
Abstract
A framework previously introduced in [3] for solving a sequence of stochastic optimization problems with bounded changes in the minimizers is extended and applied to machine learning problems such as regression and classification. The stochastic optimization problems arising in these machine learning problems is solved using algorithms such as stochastic gradient descent (SGD). A method based on estimates of the change in the minimizers and properties of the optimization algorithm is introduced for adaptively selecting the number of samples at each time step to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer produced by the optimization algorithm and the exact minimizer, does not exceed a target level. A bound is developed to show that the estimate of the change in the minimizers is non-trivial provided that the excess risk is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Adaptive Sequential Machine Learning
Craig Wilson, Yuheng Bu and Venugopal Veeravalli Research reported in the paper was supported by the NSF under award CCF 11-11342, and by the Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196, through the University of Illinois at Urbana-Champaign. Part of this work was presented in ICASSP 2016 [1] and Asilomar Conference 2016 [2]. Craig Wilson is now at Google. University of Illinois at Urbana-Champaign
{wilson60, bu3, vvv}@illinois.edu
Abstract
A framework previously introduced in [3] for solving a sequence of stochastic optimization problems with bounded changes in the minimizers is extended and applied to machine learning problems such as regression and classification. The stochastic optimization problems arising in these machine learning problems is solved using algorithms such as stochastic gradient descent (SGD). A method based on estimates of the change in the minimizers and properties of the optimization algorithm is introduced for adaptively selecting the number of samples at each time step to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer produced by the optimization algorithm and the exact minimizer, does not exceed a target level. A bound is developed to show that the estimate of the change in the minimizers is non-trivial provided that the excess risk is small enough. Extensions relevant to the machine learning setting are considered, including a cost-based approach to select the number of samples with a cost budget over a fixed horizon, and an approach to applying cross-validation for model selection. Finally, experiments with synthetic and real data are used to validate the algorithms.
I Introduction
Consider solving a sequence of machine learning problems by minimizing the risk, i.e., expected value of a fixed loss function at each time :
[TABLE]
where denotes the underlying (unknown) probabilistic model for the data at time . For regression, corresponds to the {predictors, response} pair at time and parameterizes the regression model. For classification, corresponds to the {features, label} pair at time , and parameterizes the classifier. Although, motivated by regression, and classification, our framework works for any loss function that satisfies certain properties discussed in Section II-B.
We assume that the change in the problems is bounded by imposing a condition on the minimizers of the function . We assume that the problems change at a bounded but unknown rate:
[TABLE]
The value of is unknown to us.
Under this model, we find approximate minimizers of each function by drawing samples at time . We do not make any assumptions about the particular optimization algorithm that may be used to find the approximate minimizers. As an example, we could use these samples in an optimization algorithm such as SGD. We evaluate the quality of our approximate minimizers through an excess risk criterion , i.e.,
[TABLE]
which is a standard criterion for optimization and learning problems [4]. Our goal is to determine adaptively the number of samples required to achieve a desired excess risk for large enough with unknown. As is unknown, we will first construct an estimate of . Given an estimate of , we determine selection rules for the number of samples to achieve a target excess risk .
This paper is a continuation of the work initiated in [3]. We specialize the results in [3], which were given for general functions , to the specific form in (1), and provide new results that are specifically relevant to machine learning problems. We develop a bound to show that our estimate is non-trivial provided that the excess risk is small enough. We also consider extensions relevant to the machine learning setting, including a cost-based approach to select the number of samples with a cost budget over a fixed horizon, and an approach to applying cross-validation for model selection. Some of the results in this paper have reported in conference publications [1] and [2], which do not contain proofs of the key results due to space limitations. Moreover, we provide substantially more detailed numerical results and simulations in this paper than those given in [1] and [2].
I-A Related Work
Our problem has connections with multi-task learning (MTL) and transfer learning. In multi-task learning, one tries to learn several tasks simultaneously as in [5],[6], and [7] by exploiting the relationships between the tasks. In transfer learning, knowledge from one source task is transferred to another target task either with or without additional training data for the target task [8]. For multi-task and transfer learning, there are theoretical guarantees on regret for some algorithms [9]. Multi-task learning could be applied to our problem by running a MTL algorithm each time a new task arrives, while remembering all prior tasks. However, this approach incurs a memory and computational burden. Transfer learning lacks the sequential nature of our problem.
We can also consider the concept drift problem in which we observe a stream of incoming data that potentially changes over time, and the goal is to predict some property of each piece of data as it arrives. After prediction, we incur a loss that is revealed to us. For example, we could observe a feature and predict the label as in [10]. Some approaches for concept drift use iterative algorithms such as SGD, but without specific models on how the data changes. As a result, only simulation results showing good performance are available.
Another related problem is online optimization, where generally no knowledge is available about the incoming functions other than that all the functions come from a specified class of functions, i.e., linear or convex functions with uniformly bounded gradients. Online optimization models do not include the notion of a desired excess risk bound. Rather, only bounds on the regret over some time horizon have been investigated [11, 12, 13, 14, 15, 16, 17, 18, 19, 20], which is different from the per time-step excess risk guarantee provided in our work.
There has been some work on controlling the variation of the sequence of functions in (1) in [21] and [22]. The work in [22] is most relevant where regret is minimized subject to a bound, say , on the total variation of the gradients over a time interval of interest, i.e.,
[TABLE]
If the functions are strongly convex with the same parameter , then by the optimality conditions (see Theorem 2F.10 in [23]) (4) implies that
[TABLE]
Thus, the work in [22] can be seen as studying the regret with a constraint on the total variation in the minimizers over time instants. In contrast, we control the variation of the minimizers at each time instant with (2) and then seek to maintain an excess risk criterion such as (3) at each time instant.
Another relevant model is sequential supervised learning (see [24]) in which we observe a stream of data consisting of feature/label pairs at time , with being the feature vector and being the label. At time , we want to predict given . One approach to this problem, studied in [25] and [26], is to look at consecutive pairs and develop a predictor at time by applying a supervised learning algorithm to this training data. Another approach is to assume that there is an underlying hidden Markov model (HMM) governing the data [27]. The label represents the hidden state and the pair represents the observation with being a noisy version of . HMM inference techniques are used to estimate .
The adaptation that we discuss in the paper is similar in spirit to that in prior work in adaptive signal processing (see, e.g., [28, 29, 30]), but the techniques that we use are substantively different.
To summarize, none of the prior work discussed in this section involves choosing the number of samples at each time to control the excess risk. Most approaches instead focus on bounding the regret or provide no guarantees.
I-B Paper Outline
The rest of this paper is outlined as follows. In Section II, we specialize the work in [3] to the machine learning problem stated in (1). In Section II-B, we consider the problem of minimizing the sequence of functions in (1) with from (2) known. In Section II-D, we introduce a method to estimate . In Section II-E, we consider solving the sequence of learning problems in (1) with unknown. In Section III, we develop an upper bound on the size of the overshoot of our estimate of above the true value of . In Section IV, we consider a cost based approach to select the number of samples based on the analysis in Section II, and a cross-validation approach. Finally, in Section V, we apply our framework to a variety of machine learning problems on both synthetic and real data.
II Adaptive Sequential Optimization
We summarize our previous work in [3], and apply it to the machine learning problem stated in (1).
II-A Assumptions
We make several assumptions to proceed. First, let be closed and convex with . Define the -algebra
[TABLE]
which is the smallest -algebra such that the random variables in the set are measurable. By convention is the trivial -algebra.
We suppose that the following conditions hold:
A.1 For each , is twice continuously differentiable with respect to .
A.2 For each , is strongly convex with a parameter , i.e.,
[TABLE]
where is the Euclidean inner product between and .
A.3 Given an optimization algorithm that generates an approximate minimizer
[TABLE]
using samples , there exists a function such that the following conditions hold:
- 1.
If and are both -measurable random variables, it holds that
[TABLE] 2. 2.
If and are constants, it holds that
[TABLE] 3. 3.
The bound is non-decreasing in and non-increasing in .
A.4 Initial approximate minimizers and satisfy
[TABLE]
with and known.
Remarks: For assumption II-A, we assume that the bound depends on the number of samples and not the number of iterations. For the basic version of SGD, generally the number of iterations equals , as each sample is used to produce a noisy gradient. See Appendix A of [3] for a discussion of useful bounds. For some bounds , we may need to know parameters such as the strong convexity parameter. Estimating these parameters is discussed in Appendix C of [3]. Finally, for assumption II-A, we can fix and set for .
II-B Change in Minimizers Known
Following [3], we examine the case when the change in minimizers, in (2), is known. Suppose that bounds the excess risk at time . Using the triangle inequality, strong convexity, Jensen’s inequality, and (2), we have
[TABLE]
Now, by using the bound from assumption II-A, we set
[TABLE]
yielding a sequence of bounds on the excess risk. Note that this recursion only relies on the immediate past at time through . To achieve for all , we set
[TABLE]
and for with
[TABLE]
In comparison, if we did not exploit the fact that the change is bounded by , we would use the estimate to bound and select . If the bound in (9) is smaller than , then we would need significantly fewer samples to guarantee a desired excess risk.
II-C * May Be Too Large*
In this section, we look at a case where can be too large. Suppose that , so the problems are not changing. In this case, we only need to take training samples at the first time instant and then we can stop taking samples, i.e., and for .
Suppose that and . In this case, from the analysis in the previous section, we pick
[TABLE]
For an algorithm like SGD, the bound is roughly of the form (see [3]):
[TABLE]
The first term captures the asymptotic behavior of SGD and the second term accounts for the initial distance . This form of implies that . However, by picking for all , we could achieve for all .
This shows that the choice of is conservative and can be too large if the initial distance . As a general rule, the choice of is useful if the term that depends on the initial distance, , is comparable to the asymptotic term, , in the bound.
II-D Estimating the Change in the Minimizers
In practice, we do not know , so we must construct an estimate using the samples from each distribution . We introduce an approaches to estimate the one time step change, , and methods to combine these estimates to produce an overall estimate of . First, we work with the assumption that
[TABLE]
as an intermediate step, and second, under assumption (2). These estimates are from [3]. For appropriately chosen sequences and for all large enough, we have almost surely. With this property, analysis similar to that in Section II-B holds, which is provided in Section II-E.
II-D1 Estimating One Step Change
First, we develop an estimate of the one step changes using a method from [3]. Implicitly, we assume that all one step estimates are bounded by , since trivially .
Using the triangle inequality and variational inequalities from [23] yields
[TABLE]
We then approximate by a sample average approximation to yield the following estimate called the direct estimate:
[TABLE]
II-D2 Combining One Step Estimates For Constant Change
Assuming that from (12), we average the one step estimates to yield an overall estimate
[TABLE]
To proceed with our analysis, suppose that the following conditions hold:
B.1
For each , we can draw stochastic gradients such that
[TABLE]
holds
B.2
There exist constants such that
[TABLE]
B.3
There exist constants such that
[TABLE]
B.4
It holds that
[TABLE]
and
[TABLE]
B.5
The gradients are bounded in the sense that
[TABLE]
Assumption B.1 guarantees that the gradients are unbiased. Assumption B.2 controls how fast the gradients grow as we move away from the minimizer . Assumption B.3 controls how far apart two independent outputs of the optimization algorithm and are, starting from . Assumption B.4 controls how the gradient grows for two pairs and . Finally, assumption B.5 is reasonable if the space that contains the has finite diameter and the gradients of the lost function are continuous jointly in . In this case, it holds that
[TABLE]
Theorem 1 from [3] guarantees that the direct estimate from (13) bounds .
Theorem 1**.**
Provided that B.3-B.5 hold and our sequence 111Note that a choice of that is no greater than works here. satisfies
[TABLE]
it holds that for all large enough
[TABLE]
*almost surely with
defined in (17)*
Proof.
See [3]. ∎
II-D3 Combining One Step Estimates For Bounded Change
We now look at estimating in the case that . We set
[TABLE]
Although, it may seem natural to combine the estimates using
[TABLE]
this method has a serious drawback. Since are random variables, if we combine them by taking their maximum, any particular one step estimate that is large will pull up the overall estimate . This would drive , as , resulting in a that is trivial in the limit of large .
We introduce an estimate from [3] that overcomes this defect. We need the following assumptions:
B.4
We have estimates that are non-decreasing in their arguments such that
[TABLE]
B.5
There exists absolute constants for any fixed such that
[TABLE]
For example, if , then
[TABLE]
is an estimator of with the required properties. Also, note that the two conditions on the estimator in B.5 imply that
[TABLE]
Given an estimator satisfying assumption B.5, we compute
[TABLE]
and set
[TABLE]
Under assumptions B.3-B.5, we can then show that
[TABLE]
eventually upper bounds , as stated in the following theorem.
Theorem 2**.**
Provided that B.3-B.5 hold and our sequence satisfies
[TABLE]
it holds that for all large enough
[TABLE]
with from Theorem 1.
Proof.
See [3]. ∎
II-E Change in Minimizers Unknown
We now present an extension of the results in Section II-B, obtained by replacing with its estimate given in Section II-D. Our analysis depends on the following crucial assumption:
C.1
For appropriate sequences , for all sufficiently large it holds that almost surely.
C.2
factors as
We have demonstrated that assumption C.1 holds for the direct estimate of under (12) and (2). Note that whether we assume (12) or (2) does not matter for analysis. We start with a general result showing that for appropriate choices of , we control the excess risk.
Theorem 3**.**
Under assumptions C.1- C.2, with for all large enough, where is defined in (11), we have
[TABLE]
almost surely
Proof.
See [3]. ∎
This theorem shows that for any choice of samples such that for large enough, it follows that the excess risk can be controlled in the sense of (22).
II-E1 Update Past Excess Risk Bounds
We first consider updating all past excess risk bounds as we go. At time , we plug-in in place of and follow the analysis of Section II-B. Define for
[TABLE]
If it holds that , then for . Assumption C.1 guarantees that this holds for all large enough almost surely. We can thus set equal to the smallest such that
[TABLE]
for all to achieve excess risk . The maximum in this definition ensures that when , with from (11). We can therefore apply Theorem 3.
II-E2 Do Not Update Past Excess Risk Bounds
Updating all past estimates of the excess risk bounds from time up to imposes a computational and memory burden. Suppose that for all we set
[TABLE]
This is the same form as the choice in (11) with in place of . Due to assumption C.1, for all large enough it holds that almost surely. Then by the monotonicity assumption in II-A, for all large enough we pick almost surely. We can therefore apply Theorem 3.
III Bound on -Estimate Overshoot
Since we assume that the solution space has bounded diameter, we always have the trivial bound
[TABLE]
An estimate of the change in minimizers, , is only interesting if the bound is non-trivial, i.e., when . In prior work [3], we have proved the for sufficiently large , almost surely. In this section, we look at proving an upper bound on how much can overshoot to show that this estimate is non-trivial.
When we proved that eventually upper bounds , we did not use the fact that the points at which we are evaluating the one-step estimates are approximate minimizers. In particular, that proof would still hold even if we selected the randomly from the solution space without using the samples at all. In contrast, controlling the overshoot depends critically on the fact that the points at which we evaluate the one-step estimates are approximate minimizers. The solution quality of the approximate minimizers measured by in (3) will control the size of the overshoot, as seen in the following theorem.
Theorem 4**.**
Suppose that the following conditions hold:
The sequence of excess risks achieved, , , satisfies
[TABLE] 2. 2.
The loss function has Lipschitz continuous gradients with parameter , i.e.,
[TABLE] 3. 3.
For all large enough, we have that for a constant .
Then it follows that
[TABLE]
where
[TABLE]
Proof.
First, we look at the one step estimates. It holds that
[TABLE]
By the Lipschitz gradient assumption, we have
[TABLE]
Then it follows by strong convexity that
[TABLE]
and therefore we have
[TABLE]
Since the square-root is concave, by Jensen’s inequality, we have
[TABLE]
This in turn implies that
[TABLE]
Next, we look at bounding . Define
[TABLE]
Then we have
[TABLE]
Using the direct estimate lower bound analysis from [3] it follows that
[TABLE]
This shows that
[TABLE]
Then plugging in the definition of it follows that
[TABLE]
∎
This shows that the direct estimate is a non-trivial upper bound for sufficiently small . Note that in practice, the will be a function of , since we can pick with defined in (11). Note that is itself a function of . This means the term in (28), which is a function of is also a function of . Thus the entire overshoot term is a function of , and in fact by inspection, it goes to zero as if as (as defined in (11) does).
IV Extensions Relevant to Machine Learning Applications
IV-A Cost Approach
A natural way to assess the usefulness of our approach is to choose a number of samples over a horizon of length using the choice in (24) and (25), and compare against taking
[TABLE]
samples at time and no samples at the other time instants. See Section V for such a comparison.
In this section, we consider a different type of comparison based on assuming that there is a cost of taking samples. For example, we could have
[TABLE]
This implies we pay a fixed cost of any time we take at least one sample and a marginal cost of per sample. We want to control the excess risk by deciding when to take samples, and how many samples to take with a total budget over a horizon of length , i.e.,
[TABLE]
For the option of taking all samples up front:
[TABLE]
Another option is to sample every time instants and divide the cost budget evenly over the times that we take samples using
[TABLE]
For analysis, we need Assumption C.1 and the following additional assumptions:
D.1
There exists a function such that
[TABLE]
For example, suppose that the functions have Lipschitz continuous gradients with modulus and for all , where is the interior of . By the descent lemma [31], we have
[TABLE]
Thus, we can set
[TABLE]
Since we need to consider the possibility that for some in but still provide estimates of the excess risk, we need an alternate version of the bound in (23). Define
[TABLE]
where is the last time no later than at which samples were taken. If no samples have been taken so far, then by convention . We construct the recursively defined function by considering the following four cases:
No samples have been taken by time :
[TABLE] 2. 2.
Samples taken at time for the first time
[TABLE] 3. 3.
No samples taken at time but samples have been taken previously
[TABLE] 4. 4.
Samples taken at time and samples have been taken previously
[TABLE]
where is the bound on the excess risk at time .
Suppose that over a time horizon of length we have a total cost budget with respect to the number of samples as in (30). Define the excess risk gaps
[TABLE]
with . The variable is the extent to which the target excess risk of is violated upwards. If our excess risk is below our target level , then we set . Our goal is to minimize the size of the , while taking into account the cost constraint in (30). To control the size of , suppose that we have a function that describes the cumulative loss of the excess risk gaps .
We now provide some possible choices for :
[TABLE]
[TABLE]
[TABLE]
with
[TABLE]
The choices given in (33) and (34) penalize the average and maximum excess risk gaps, respectively. In practice, with these choices, we will stop taking samples before the horizon resulting in relatively poor performance towards the end of the horizon. The third choice gets around this problem by penalizing large increasing runs of excess risk gaps, and tends to favor a more uniform choice of the number of samples .
We first consider the case when is known to us and plan over the horizon of length by solving the following optimization problem:
[TABLE]
The idea of this problem is to satisfy the excess risk bound with minimal violation .
To estimate , we need samples from consecutive time instants. Therefore, we impose the constraint that if we take samples at time , then we must take samples at either time or time through the constraint
[TABLE]
The problem in (36) is a mixed integer non-linear programming problem (MINLP). There are no general methods to efficiently solve this MINLP, and we therefore consider a relaxation of this problem later.
In the case that we know , we can plan the number of samples ahead of time before any samples have been taken. When is unknown, we cannot plan over the entire horizon. Instead, at each time instant we have to plan over the remaining time horizon of length , while using the estimate in place of and the remaining cost budget
[TABLE]
We then consider the cost-to-go problem
[TABLE]
This is the same form as (36), except that it is over the time horizon from taking into account the portion of the cost budget that has been expended. In this problem, we only optimize over . This problem is again a MINLP.
Next, we look at approximate solutions to (36) and (37). The major difficulties in solving these programs are that the decision variables are integer-valued and the cost function may be discontinuous at zero due to fixed costs. We consider relaxing to be real-valued and introduce a piecewise approximation of the cost functions :
[TABLE]
Generally, we pick . We consider the relaxed program
[TABLE]
We also relax the indicator constraints to inequality to encourage taking samples at consecutive times. In practice, this forces more gradual changes in samples and makes it easier to solve these problems. This problem can be readily solved by gradient based solvers such as IPOPT [32].
When is unknown, we can repeatedly solve this problem using the latest estimate of by solving the following sequence of problems:
[TABLE]
IV-B Cross Validation
We can also apply cross-validation for model selection. Suppose we have loss functions parameterized by , which controls the model complexity. For example, we could have a quadratic penalty term
[TABLE]
The value of corresponds to the true loss function that we want to minimize. Suppose we have different values of under consideration. For each , we generate an approximate minimizer of
[TABLE]
We want to select the value and corresponding that achieves the smallest loss
[TABLE]
We generate an approximate minimizer for each problem in (40) starting from . To select the best choice of in terms of minimizing (41), we apply cross-validation and set [33].
The idea behind cross-validation is to divide the training samples into equal sized pieces. For every out of pieces, we use the pieces of the training set to generate an approximate solution to (40). We use the remaining piece of the training set to evaluate the empirical test loss achieved by using a sample average approximation. We do this for every possible choice of out of pieces and average the empirical test loss estimates. We then select the value that achieves the smallest empirical test loss.
To apply cross-validation to our framework, we run parallel versions of our approach and at time we generate different choices for the number of samples . We then choose
[TABLE]
After choosing , we apply the usual cross-validation approach to select for time . Fig. 1 shows this approach for two values of .
V Experiments
We provide two regression examples for synthetic and real data as well as a classification example for synthetic data. For the synthetic regression problem, we can explicitly compute and and exactly evaluate the performance of our method. It is straightforward to check that all requirements in II-A-II-A are satisfied for the problems considered in this section. We apply the “do not update past excess risk" choice of here.
V-A Synthetic Regression
Consider a regression problem with synthetic data using the penalized quadratic loss
[TABLE]
with . We further assume that
[TABLE]
Under these assumptions, we can analytically compute minimizers of . We change only and appropriately to ensure that holds for all . We find approximate minimizers using SGD with . We estimate using the direct estimate.
We let range from to with , a target excess risk , and from (25). We average over twenty runs of our algorithm. Figure 2 shows , our estimate of , which is above in general. Figure 3 shows the number of samples , which settles down. We can exactly compute , and so by averaging over the twenty runs of our algorithm, we can estimate the excess risk (denoted “sample average estimate”). We over the time horizon from to to yield the sample average estimate excess risk given by . Therefore, we see that we achieve our desired excess risk.
V-A1 Cost Approach
We consider applying the cost approach in Section IV-A to the synthetic regression problem with the cost in (29). We compare the optimal cost approach introduced in (38) of Section IV-A to the approach in (25), taking all samples at time as in (31), and taking samples every five time instants as in (32). Note that the method from (25) does not satisfy the cost budget. Fig. 4 shows the test loss of these approaches. We achieve similar test loss to the method in (25) and better than the other two methods. Fig. 5 shows the number of samples selected for both methods. At some time instants, our optimal cost approach does not take samples.
This problem is an example of one where the initial distance term in per the discussion from Section III matters. This is evidenced by the fact that when we do not take samples after the first time instant the test loss can grow large quickly as shown in Fig. 4.
V-B Synthetic Classification
Consider a binary classification problem using
[TABLE]
with and . This is a smoothed version of the hinge loss used in support vector machines (SVM) [33]. We suppose that at time , the two classes have features drawn from a Gaussian distribution with covariance matrix but different means and , i.e.,
[TABLE]
The class means move slowly over uniformly spaced points on a unit sphere in as in Figure 6 to ensure that the constant Euclidean norm condition defined in (12) holds. We find approximate minimizers using SGD with . We estimate using the direct estimate.
We let range from to and target a excess risk . We average over twenty runs of our algorithm. As a comparison, if our algorithm takes samples, then we consider taking samples up front at . This is what we would do if we assumed that our problem is not time varying. Figure 7 shows , our estimate of . Figure 8 shows the average test loss for both sampling strategies. To compute test loss we draw additional samples from and compute . We see that our approach achieves substantially smaller test loss than taking all samples up front. We do not draw the error bars on this plot as it makes it difficult to see the actual losses achieved.
To further evaluate our approach we look at the receiver operating characteristic (ROC) of our classifiers. The ROC is a plot of the probability of a true positive against the probability of a false positive. The area under the curve (AUC) of the ROC equals the probability that a randomly chosen positive instance () will be rated higher than a negative instance () [34]. Thus, a large AUC is desirable. Figure 9 plots the AUC of our approach against taking all samples up front. Our sampling approach achieve a substantially larger AUC.
V-C Panel Study on Income Dynamics Income - Regression
The Panel Study of Income Dynamics (PSID) surveyed individuals every year to gather demographic and income data annually from 1974-2012 [35]. We want to predict an individual’s annual income () from several demographic features () including age, education, work experience, etc. chosen based on previous economic studies in [36]. The idea of this problem conceptually is to rerun the survey process and determine how many samples we would need if we wanted to solve this regression problem to within a desired excess risk criterion .
We use the same loss function, direct estimate for , and minimization algorithm as the synthetic regression problem. We average over twenty runs of our algorithm by resampling without replacement [33]. For the sake of comparison, given a choice of samples produced by our approach, we compare against taking samples at time and none afterwards. Note that this is what we would do if we believed that the regression model does not change over time. We are aware of no other methods to select the number of samples to control the excess risk against which we could compare our approach.
Figure 10 shows the number of samples , which settles down quickly. Figure 11 shows . Figure 12 shows the test losses over time evaluated over twenty percent of the available samples. The test loss for our approach is substantially less than that obtained by taking the same number of samples up front.
VI Conclusion
We introduced a framework for adaptively solving a sequence of learning problems. We developed estimates of the change in the minimizers used to determine the number of training samples needed to achieve a target excess risk . We introduced a cost based approach to select the number of samples and an approach to apply cross-validation. Experiments with synthetic and real data demonstrate that this approach is effective.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Wilson and V.V. Veeravalli, “Adaptive sequential optimization with applications to machine learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing , Shanghai, China, Mar. 2016, pp. 2642–2646.
- 2[2] C. Wilson and V. Veeravalli, “Adaptive sequential learning,” in 2016 50th Asilomar Conference on Signals, Systems and Computers , Nov. 2016, pp. 326–330.
- 3[3] C. Wilson, V.V. Veeravalli, and Angelia Nedić, “Adaptive sequential stochastic optimization,” ar Xiv:1610.01970 , Oct. 2016. (To appear in IEEE Transactions on Automatic Control, March 2019).
- 4[4] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning , The MIT Press, 2012.
- 5[5] A. Agarwal, H. Daumé, and S. Gerber, “Learning multiple tasks using manifold regularization.,” in Advances in Neural Information Processing Systems (NIPS) , 2011, pp. 46–54.
- 6[6] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , New York, NY, USA, 2004, KDD ’04, pp. 109–117, ACM.
- 7[7] Y. Zhang and D. Yeung, “A convex formulation for learning task relationships in multi-task learning,” Co RR , vol. abs/1203.3536, 2012.
- 8[8] S. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering , vol. 22, no. 10, pp. 1345–1359, Oct 2010.
