Adaptive Sequential Machine Learning

Craig Wilson; Yuheng Bu; Venugopal Veeravalli

arXiv:1904.02773·cs.LG·April 8, 2019

Adaptive Sequential Machine Learning

Craig Wilson, Yuheng Bu, Venugopal Veeravalli

PDF

TL;DR

This paper extends a framework for adaptive sequential optimization to machine learning tasks, proposing methods to select sample sizes dynamically to control excess risk, validated through experiments on synthetic and real data.

Contribution

It introduces an adaptive sampling method based on minimizer change estimates for machine learning, enhancing efficiency in stochastic optimization.

Findings

01

The proposed method effectively controls excess risk.

02

Adaptive sampling reduces computational costs.

03

Experimental results validate the approach.

Abstract

A framework previously introduced in [3] for solving a sequence of stochastic optimization problems with bounded changes in the minimizers is extended and applied to machine learning problems such as regression and classification. The stochastic optimization problems arising in these machine learning problems is solved using algorithms such as stochastic gradient descent (SGD). A method based on estimates of the change in the minimizers and properties of the optimization algorithm is introduced for adaptively selecting the number of samples at each time step to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer produced by the optimization algorithm and the exact minimizer, does not exceed a target level. A bound is developed to show that the estimate of the change in the minimizers is non-trivial provided that the excess risk is…

Equations206

w \in X min {f_{n} (w) ≜ E_{z_{n} \sim p_{n}} [ℓ (w, z_{n})]} \forall n \geq 1

w \in X min {f_{n} (w) ≜ E_{z_{n} \sim p_{n}} [ℓ (w, z_{n})]} \forall n \geq 1

∥ w_{n}^{*} - w_{n - 1}^{*} ∥ \leq ρ, \forall n \geq 2.

∥ w_{n}^{*} - w_{n - 1}^{*} ∥ \leq ρ, \forall n \geq 2.

E [f_{n} (w_{n})] - f_{n} (w_{n}^{*}) \leq ϵ

E [f_{n} (w_{n})] - f_{n} (w_{n}^{*}) \leq ϵ

n = 2 \sum T w \in X max ∥\nabla f_{n + 1} (w) - \nabla f_{n} (w) ∥_{2}^{2} \leq G_{b} .

n = 2 \sum T w \in X max ∥\nabla f_{n + 1} (w) - \nabla f_{n} (w) ∥_{2}^{2} \leq G_{b} .

n = 2 \sum T ∥ w_{n + 1}^{*} - w_{n}^{*} ∥_{2}^{2} \leq \frac{G _{b}}{m ^{2}} .

n = 2 \sum T ∥ w_{n + 1}^{*} - w_{n}^{*} ∥_{2}^{2} \leq \frac{G _{b}}{m ^{2}} .

F_{i} ≜ σ ({z_{j} (k) : j = 1, \dots, i; k = 1, \dots, K_{j}})

F_{i} ≜ σ ({z_{j} (k) : j = 1, \dots, i; k = 1, \dots, K_{j}})

f_{n} (\tilde{w}) \geq f_{n} (w) + ⟨ \nabla_{w} f_{n} (w), \tilde{w} - w ⟩ + \frac{1}{2} m ∥ \tilde{w} - w ∥^{2} .

f_{n} (\tilde{w}) \geq f_{n} (w) + ⟨ \nabla_{w} f_{n} (w), \tilde{w} - w ⟩ + \frac{1}{2} m ∥ \tilde{w} - w ∥^{2} .

w_{n} ≜ A (w_{n - 1}, {z_{n} (k)}_{k = 1}^{K_{n}})

w_{n} ≜ A (w_{n - 1}, {z_{n} (k)}_{k = 1}^{K_{n}})

∥ w_{n - 1} - w_{n}^{*} ∥^{2} \leq d_{0}^{2}

∥ w_{n - 1} - w_{n}^{*} ∥^{2} \leq d_{0}^{2}

\Rightarrow E [f_{n} (w_{n}) ∣ F_{n - 1}] - f_{n} (w_{n}^{*}) \leq b (d_{0}, K_{n}) .

E ∥ w_{n - 1} - w_{n}^{*} ∥^{2} \leq γ^{2}

E ∥ w_{n - 1} - w_{n}^{*} ∥^{2} \leq γ^{2}

\Rightarrow E [f_{n} (w_{n})] - f_{n} (w_{n}^{*}) \leq b (γ, \tilde{K}_{n}) .

f_{i} (w_{i}) - f_{i} (w_{i}^{*}) \leq ϵ_{i} i = 1, 2

f_{i} (w_{i}) - f_{i} (w_{i}^{*}) \leq ϵ_{i} i = 1, 2

E ∥ w_{n - 1} - w_{n}^{*} ∥^{2}

E ∥ w_{n - 1} - w_{n}^{*} ∥^{2}

ϵ_{n}

ϵ_{n}

K_{1} = min {K \geq 1 ∣ b (diam (X), K) \leq ϵ}

K_{1} = min {K \geq 1 ∣ b (diam (X), K) \leq ϵ}

K^{*}=\min\left\{K\geq 1\;\Bigg{|}\;b\left(\sqrt{\frac{2\epsilon}{m}}+\rho,K\right)\leq\epsilon\right\}

K^{*}=\min\left\{K\geq 1\;\Bigg{|}\;b\left(\sqrt{\frac{2\epsilon}{m}}+\rho,K\right)\leq\epsilon\right\}

K^{*} = min {K \geq 1 ∣ b (\frac{2 ϵ}{m}, K) \leq ϵ}

K^{*} = min {K \geq 1 ∣ b (\frac{2 ϵ}{m}, K) \leq ϵ}

b (d_{0}, K) \approx \frac{1}{K} + \frac{d _{0}^{2}}{K ^{2}}

b (d_{0}, K) \approx \frac{1}{K} + \frac{d _{0}^{2}}{K ^{2}}

∥ w_{i}^{*} - w_{i - 1}^{*} ∥ = ρ

∥ w_{i}^{*} - w_{i - 1}^{*} ∥ = ρ

∥ w_{i}^{*} - w_{i - 1}^{*} ∥

∥ w_{i}^{*} - w_{i - 1}^{*} ∥

\leq ∥ w_{i} - w_{i - 1} ∥ + \frac{1}{m} ∥ \nabla_{w} f_{i} (w_{i}) ∥ + \frac{1}{m} ∥ \nabla_{w} f_{i} (w_{i - 1}) ∥

\tilde{ρ}_{i}

\tilde{ρ}_{i}

\displaystyle\qquad\qquad\qquad\qquad+\frac{1}{m}\Bigg{\|}\frac{1}{K_{i-1}}\sum_{k=1}^{K_{i-1}}\nabla_{\bm{w}}\ell(\bm{w}_{i-1},\bm{z}_{i-1}(k))\Bigg{\|}

\overset{ρ}{^}_{n} = \frac{1}{n - 1} i = 2 \sum n \tilde{ρ}_{i}

\overset{ρ}{^}_{n} = \frac{1}{n - 1} i = 2 \sum n \tilde{ρ}_{i}

E [\nabla_{w} ℓ_{n} (w, z_{n})] = \nabla f_{n} (w) .

E [\nabla_{w} ℓ_{n} (w, z_{n})] = \nabla f_{n} (w) .

E [∥ \nabla_{w} ℓ_{n} (w, z_{n}) ∥_{2}^{2}] \leq A + B ∥ w - w_{n}^{*} ∥_{2}^{2}

E [∥ \nabla_{w} ℓ_{n} (w, z_{n}) ∥_{2}^{2}] \leq A + B ∥ w - w_{n}^{*} ∥_{2}^{2}

E [∥ w_{i} - \tilde{w}_{i} ∥^{2} ∣ F_{i - 1}] \leq C_{i}^{2} (K_{i})

E [∥ w_{i} - \tilde{w}_{i} ∥^{2} ∣ F_{i - 1}] \leq C_{i}^{2} (K_{i})

E

E

E [∥ \nabla_{w} ℓ_{i} (w, z_{i}) - \nabla f_{i} (w) ∥^{2} ∣ F_{i - 1}] \leq σ^{2}

E [∥ \nabla_{w} ℓ_{i} (w, z_{i}) - \nabla f_{i} (w) ∥^{2} ∣ F_{i - 1}] \leq σ^{2}

∥ \nabla_{w} ℓ_{n} (w, z) ∥ \leq G \forall w \in X, z \in Z

∥ \nabla_{w} ℓ_{n} (w, z) ∥ \leq G \forall w \in X, z \in Z

w \in X, z \in Z sup ∥ \nabla_{w} ℓ_{n} (w, z) ∥ < \infty

w \in X, z \in Z sup ∥ \nabla_{w} ℓ_{n} (w, z) ∥ < \infty

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive Sequential Machine Learning

Craig Wilson, Yuheng Bu and Venugopal Veeravalli Research reported in the paper was supported by the NSF under award CCF 11-11342, and by the Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196, through the University of Illinois at Urbana-Champaign. Part of this work was presented in ICASSP 2016 [1] and Asilomar Conference 2016 [2]. Craig Wilson is now at Google. University of Illinois at Urbana-Champaign

{wilson60, bu3, vvv}@illinois.edu

Abstract

A framework previously introduced in [3] for solving a sequence of stochastic optimization problems with bounded changes in the minimizers is extended and applied to machine learning problems such as regression and classification. The stochastic optimization problems arising in these machine learning problems is solved using algorithms such as stochastic gradient descent (SGD). A method based on estimates of the change in the minimizers and properties of the optimization algorithm is introduced for adaptively selecting the number of samples at each time step to ensure that the excess risk, i.e., the expected gap between the loss achieved by the approximate minimizer produced by the optimization algorithm and the exact minimizer, does not exceed a target level. A bound is developed to show that the estimate of the change in the minimizers is non-trivial provided that the excess risk is small enough. Extensions relevant to the machine learning setting are considered, including a cost-based approach to select the number of samples with a cost budget over a fixed horizon, and an approach to applying cross-validation for model selection. Finally, experiments with synthetic and real data are used to validate the algorithms.

I Introduction

Consider solving a sequence of machine learning problems by minimizing the risk, i.e., expected value of a fixed loss function $\ell(\bm{w},\bm{z})$ at each time $n$ :

[TABLE]

where $p_{n}$ denotes the underlying (unknown) probabilistic model for the data $\bm{z}_{n}$ at time $n$ . For regression, $\bm{z}_{n}=\{\bm{x}_{n},y_{n}\}$ corresponds to the {predictors, response} pair at time $n$ and $\bm{w}$ parameterizes the regression model. For classification, $\bm{z}_{n}=\{\bm{x}_{n},y_{n}\}$ corresponds to the {features, label} pair at time $n$ , and $\bm{w}$ parameterizes the classifier. Although, motivated by regression, and classification, our framework works for any loss function $\ell(\bm{w},\bm{z})$ that satisfies certain properties discussed in Section II-B.

We assume that the change in the problems is bounded by imposing a condition on the minimizers $\bm{w}_{n}^{*}$ of the function $f_{n}(\bm{w})$ . We assume that the problems change at a bounded but unknown rate:

[TABLE]

The value of $\rho$ is unknown to us.

Under this model, we find approximate minimizers $\bm{w}_{n}$ of each function $f_{n}(\bm{w})$ by drawing $K_{n}$ samples $\{\bm{z}_{n}(k)\}_{k=1}^{K_{n}}\overset{\text{iid}}{\sim}p_{n}$ at time $n$ . We do not make any assumptions about the particular optimization algorithm that may be used to find the approximate minimizers. As an example, we could use these samples in an optimization algorithm such as SGD. We evaluate the quality of our approximate minimizers $\bm{w}_{n}$ through an excess risk criterion $\epsilon$ , i.e.,

[TABLE]

which is a standard criterion for optimization and learning problems [4]. Our goal is to determine adaptively the number of samples $K_{n}$ required to achieve a desired excess risk $\epsilon$ for large enough $n$ with $\rho$ unknown. As $\rho$ is unknown, we will first construct an estimate of $\rho$ . Given an estimate of $\rho$ , we determine selection rules for the number of samples $K_{n}$ to achieve a target excess risk $\epsilon$ .

This paper is a continuation of the work initiated in [3]. We specialize the results in [3], which were given for general functions $f_{n}(\bm{w})$ , to the specific form in (1), and provide new results that are specifically relevant to machine learning problems. We develop a bound to show that our estimate $\rho$ is non-trivial provided that the excess risk is small enough. We also consider extensions relevant to the machine learning setting, including a cost-based approach to select the number of samples with a cost budget over a fixed horizon, and an approach to applying cross-validation for model selection. Some of the results in this paper have reported in conference publications [1] and [2], which do not contain proofs of the key results due to space limitations. Moreover, we provide substantially more detailed numerical results and simulations in this paper than those given in [1] and [2].

I-A Related Work

Our problem has connections with multi-task learning (MTL) and transfer learning. In multi-task learning, one tries to learn several tasks simultaneously as in [5],[6], and [7] by exploiting the relationships between the tasks. In transfer learning, knowledge from one source task is transferred to another target task either with or without additional training data for the target task [8]. For multi-task and transfer learning, there are theoretical guarantees on regret for some algorithms [9]. Multi-task learning could be applied to our problem by running a MTL algorithm each time a new task arrives, while remembering all prior tasks. However, this approach incurs a memory and computational burden. Transfer learning lacks the sequential nature of our problem.

We can also consider the concept drift problem in which we observe a stream of incoming data that potentially changes over time, and the goal is to predict some property of each piece of data as it arrives. After prediction, we incur a loss that is revealed to us. For example, we could observe a feature $\bm{x}_{n}$ and predict the label $y_{n}$ as in [10]. Some approaches for concept drift use iterative algorithms such as SGD, but without specific models on how the data changes. As a result, only simulation results showing good performance are available.

Another related problem is online optimization, where generally no knowledge is available about the incoming functions other than that all the functions come from a specified class of functions, i.e., linear or convex functions with uniformly bounded gradients. Online optimization models do not include the notion of a desired excess risk bound. Rather, only bounds on the regret over some time horizon have been investigated [11, 12, 13, 14, 15, 16, 17, 18, 19, 20], which is different from the per time-step excess risk guarantee provided in our work.

There has been some work on controlling the variation of the sequence of functions $f_{n}(\bm{w})$ in (1) in [21] and [22]. The work in [22] is most relevant where regret is minimized subject to a bound, say $G_{b}$ , on the total variation of the gradients over a time interval $T$ of interest, i.e.,

[TABLE]

If the functions $\{f_{n}(x)\}$ are strongly convex with the same parameter $m$ , then by the optimality conditions (see Theorem 2F.10 in [23]) (4) implies that

[TABLE]

Thus, the work in [22] can be seen as studying the regret with a constraint on the total variation in the minimizers over $T$ time instants. In contrast, we control the variation of the minimizers at each time instant with (2) and then seek to maintain an excess risk criterion such as (3) at each time instant.

Another relevant model is sequential supervised learning (see [24]) in which we observe a stream of data consisting of feature/label pairs $(\bm{x}_{n},y_{n})$ at time $n$ , with $\bm{x}_{n}$ being the feature vector and $y_{n}$ being the label. At time $n$ , we want to predict $y_{n}$ given $\bm{w}_{n}$ . One approach to this problem, studied in [25] and [26], is to look at $L$ consecutive pairs $\{(\bm{x}_{n-i},y_{n-i})\}_{i=1}^{L}$ and develop a predictor at time $n$ by applying a supervised learning algorithm to this training data. Another approach is to assume that there is an underlying hidden Markov model (HMM) governing the data [27]. The label $y_{n}$ represents the hidden state and the pair $(\bm{x}_{n},\overline{y}_{n})$ represents the observation with $\overline{y}_{n}$ being a noisy version of $y_{n}$ . HMM inference techniques are used to estimate $y_{n}$ .

The adaptation that we discuss in the paper is similar in spirit to that in prior work in adaptive signal processing (see, e.g., [28, 29, 30]), but the techniques that we use are substantively different.

To summarize, none of the prior work discussed in this section involves choosing the number of samples $K_{n}$ at each time $n$ to control the excess risk. Most approaches instead focus on bounding the regret or provide no guarantees.

I-B Paper Outline

The rest of this paper is outlined as follows. In Section II, we specialize the work in [3] to the machine learning problem stated in (1). In Section II-B, we consider the problem of minimizing the sequence of functions in (1) with $\rho$ from (2) known. In Section II-D, we introduce a method to estimate $\rho$ . In Section II-E, we consider solving the sequence of learning problems in (1) with $\rho$ unknown. In Section III, we develop an upper bound on the size of the overshoot of our estimate of $\rho$ above the true value of $\rho$ . In Section IV, we consider a cost based approach to select the number of samples based on the analysis in Section II, and a cross-validation approach. Finally, in Section V, we apply our framework to a variety of machine learning problems on both synthetic and real data.

II Adaptive Sequential Optimization

We summarize our previous work in [3], and apply it to the machine learning problem stated in (1).

II-A Assumptions

We make several assumptions to proceed. First, let $\mathcal{X}$ be closed and convex with $\text{diam}(\mathcal{X})<+\infty$ . Define the $\sigma$ -algebra

[TABLE]

which is the smallest $\sigma$ -algebra such that the random variables in the set $\left\{\bm{z}_{j}(k):\;j=1,\ldots,i;\;k=1,\ldots,K_{j}\right\}$ are measurable. By convention $\mathcal{F}_{0}$ is the trivial $\sigma$ -algebra.

We suppose that the following conditions hold:

A.1 For each $n$ , $f_{n}(\bm{w})$ is twice continuously differentiable with respect to $\bm{w}$ .

A.2 For each $n$ , $f_{n}(\bm{w})$ is strongly convex with a parameter $m>0$ , i.e.,

[TABLE]

where $\left\langle\bm{w},\tilde{\bm{w}}\right\rangle$ is the Euclidean inner product between $\bm{w}$ and $\tilde{\bm{w}}$ .

A.3 Given an optimization algorithm that generates an approximate minimizer

[TABLE]

using $K_{n}$ samples $\{\bm{z}_{n}(k)\}_{k=1}^{K_{n}}$ , there exists a function $b(d_{0},K_{n})$ such that the following conditions hold:

1.

If $K_{n}$ and $d_{0}$ are both $\mathcal{F}_{n-1}$ -measurable random variables, it holds that

[TABLE] 2. 2.

If $\tilde{K}_{n}$ and $\gamma$ are constants, it holds that

[TABLE] 3. 3.

The bound $b(d_{0},K_{n})$ is non-decreasing in $d_{0}$ and non-increasing in $K_{n}$ .

A.4 Initial approximate minimizers $\bm{w}_{1}$ and $\bm{w}_{2}$ satisfy

[TABLE]

with $\epsilon_{1}$ and $\epsilon_{2}$ known.

Remarks: For assumption II-A, we assume that the bound $b(d_{0},K_{n})$ depends on the number of samples $K_{n}$ and not the number of iterations. For the basic version of SGD, generally the number of iterations equals $K_{n}$ , as each sample is used to produce a noisy gradient. See Appendix A of [3] for a discussion of useful $b(d_{0},K_{n})$ bounds. For some bounds $b(d_{0},K)$ , we may need to know parameters such as the strong convexity parameter. Estimating these parameters is discussed in Appendix C of [3]. Finally, for assumption II-A, we can fix $K_{i}$ and set $\epsilon_{i}=b(\text{diam}(\mathcal{X}),K_{i})$ for $i=1,2$ .

II-B Change in Minimizers Known

Following [3], we examine the case when the change in minimizers, $\rho$ in (2), is known. Suppose that $\epsilon_{n-1}$ bounds the excess risk at time $n-1$ . Using the triangle inequality, strong convexity, Jensen’s inequality, and (2), we have

[TABLE]

Now, by using the bound $b(d_{0},K_{n})$ from assumption II-A, we set

[TABLE]

yielding a sequence of bounds on the excess risk. Note that this recursion only relies on the immediate past at time $n-1$ through $\epsilon_{n-1}$ . To achieve $\epsilon_{n}\leq\epsilon$ for all $n$ , we set

[TABLE]

and $K_{n}=K^{*}$ for $n\geq 2$ with

[TABLE]

In comparison, if we did not exploit the fact that the change is bounded by $\rho$ , we would use the estimate $\text{diam}^{2}(\mathcal{X})$ to bound $\mathbb{E}\|\bm{w}_{n-1}-\bm{w}_{n}^{*}\|^{2}$ and select $K_{n}$ . If the bound in (9) is smaller than $\text{diam}^{2}(\mathcal{X})$ , then we would need significantly fewer samples $K_{n}$ to guarantee a desired excess risk.

II-C $K^{}$ May Be Too Large*

In this section, we look at a case where $K^{*}$ can be too large. Suppose that $\rho=0$ , so the problems are not changing. In this case, we only need to take training samples at the first time instant and then we can stop taking samples, i.e., $K_{1}>0$ and $K_{n}=0$ for $n>1$ .

Suppose that $\epsilon_{1}\leq\epsilon$ and $\rho=0$ . In this case, from the analysis in the previous section, we pick

[TABLE]

For an algorithm like SGD, the bound $b(d_{0},K)$ is roughly of the form (see [3]):

[TABLE]

The first term captures the asymptotic behavior of SGD and the second term accounts for the initial distance $d_{0}$ . This form of $b(d_{0},K)$ implies that $K^{*}>0$ . However, by picking $K_{n}=0$ for all $n\geq 2$ , we could achieve $\epsilon_{n}=\epsilon_{1}\leq\epsilon$ for all $n\geq 2$ .

This shows that the choice of $K^{*}$ is conservative and can be too large if the initial distance $d_{0}=0$ . As a general rule, the choice of $K^{*}$ is useful if the term that depends on the initial distance, $d_{0}^{2}/K^{2}$ , is comparable to the asymptotic term, $1/K$ , in the $b(d_{0},K)$ bound.

II-D Estimating the Change in the Minimizers

In practice, we do not know $\rho$ , so we must construct an estimate $\hat{\rho}_{n}$ using the samples $\{\bm{z}_{n}(k)\}_{k=1}^{K_{n}}$ from each distribution $p_{n}$ . We introduce an approaches to estimate the one time step change, $\|\bm{w}_{i}^{*}-\bm{w}_{i-1}^{*}\|$ , and methods to combine these estimates to produce an overall estimate of $\rho$ . First, we work with the assumption that

[TABLE]

as an intermediate step, and second, under assumption (2). These estimates are from [3]. For appropriately chosen sequences $\{t_{n}\}$ and for all $n$ large enough, we have $\hat{\rho}_{n}+t_{n}\geq\rho$ almost surely. With this property, analysis similar to that in Section II-B holds, which is provided in Section II-E.

II-D1 Estimating One Step Change

First, we develop an estimate $\tilde{\rho}_{i}$ of the one step changes $\|\bm{w}_{i}^{*}-\bm{w}_{i-1}^{*}\|$ using a method from [3]. Implicitly, we assume that all one step estimates are bounded by $\text{diam}(\mathcal{X})$ , since trivially $\|\bm{w}_{n}^{*}-\bm{w}_{n-1}^{*}\|\leq\text{diam}(\mathcal{X})$ .

Using the triangle inequality and variational inequalities from [23] yields

[TABLE]

We then approximate $\|\nabla_{\bm{w}}f_{i}(\bm{w}_{i})\|=\|\mathbb{E}_{\bm{z}_{i}\sim p_{i}}\left[\nabla_{\bm{w}}\ell(\bm{w}_{i},\bm{z}_{i})\right]\|$ by a sample average approximation to yield the following estimate called the direct estimate:

[TABLE]

II-D2 Combining One Step Estimates For Constant Change

Assuming that $\|\bm{w}_{i}^{*}-\bm{w}_{i-1}^{*}\|=\rho$ from (12), we average the one step estimates $\tilde{\rho}_{i}$ to yield an overall estimate

[TABLE]

To proceed with our analysis, suppose that the following conditions hold:

B.1

For each $n$ , we can draw stochastic gradients $\nabla_{\bm{w}}\ell_{n}\left(\bm{w},\bm{z}_{n}\right)$ such that

[TABLE]

holds

B.2

There exist constants $A,B\geq 0$ such that

[TABLE]

B.3

There exist constants $C_{i}(K_{i})$ such that

[TABLE]

B.4

It holds that

[TABLE]

and

[TABLE]

B.5

The gradients are bounded in the sense that

[TABLE]

Assumption B.1 guarantees that the gradients are unbiased. Assumption B.2 controls how fast the gradients grow as we move away from the minimizer $\bm{w}_{n}^{*}$ . Assumption B.3 controls how far apart two independent outputs of the optimization algorithm $\bm{w}_{i}$ and $\tilde{\bm{w}}_{i}$ are, starting from $\bm{w}_{i-1}$ . Assumption B.4 controls how the gradient grows for two pairs $(\bm{w},\bm{z}_{i})$ and $(\tilde{\bm{w}},\bm{z}_{i})$ . Finally, assumption B.5 is reasonable if the space $\mathcal{Z}$ that contains the $\bm{z}_{n}$ has finite diameter and the gradients of the lost function are continuous jointly in $(\bm{w},\bm{z})$ . In this case, it holds that

[TABLE]

Theorem 1 from [3] guarantees that the direct estimate from (13) bounds $\rho$ .

Theorem 1.

Provided that B.3-B.5 hold and our sequence $\{t_{n}\}$ 111Note that a choice of $t_{n}$ that is no greater than $1/\sqrt{n-1}$ works here. satisfies

[TABLE]

it holds that for all $n$ large enough

[TABLE]

*almost surely with

$D_{n}$ defined in (17)*

Proof.

See [3]. ∎

II-D3 Combining One Step Estimates For Bounded Change

We now look at estimating $\rho$ in the case that $\|\bm{w}_{n}^{*}-\bm{w}_{n-1}^{*}\|\leq\rho$ . We set

[TABLE]

Although, it may seem natural to combine the estimates using

[TABLE]

this method has a serious drawback. Since $\{\tilde{\rho}_{i}\}$ are random variables, if we combine them by taking their maximum, any particular one step estimate $\tilde{\rho}_{i}$ that is large will pull up the overall estimate $\hat{\rho}_{n}$ . This would drive $\hat{\rho}_{n}\to\textrm{diam}(\mathcal{X})$ , as $n\to\infty$ , resulting in a $\hat{\rho}_{n}$ that is trivial in the limit of large $n$ .

We introduce an estimate from [3] that overcomes this defect. We need the following assumptions:

B.4

We have estimates $\hat{h}_{W}:\mathbb{R}^{W}\to\mathbb{R}$ that are non-decreasing in their arguments such that

[TABLE]

B.5

There exists absolute constants $\{b_{i}\}_{i=1}^{W}$ for any fixed $W$ such that $\forall\bm{p},\bm{q}\in\mathbb{R}^{W}_{\geq 0}$

[TABLE]

For example, if $\rho_{i}\overset{\text{iid}}{\sim}\text{Unif}[0,\rho]$ , then

[TABLE]

is an estimator of $\rho$ with the required properties. Also, note that the two conditions on the estimator in B.5 imply that

[TABLE]

Given an estimator satisfying assumption B.5, we compute

[TABLE]

and set

[TABLE]

Under assumptions B.3-B.5, we can then show that

[TABLE]

eventually upper bounds $\rho$ , as stated in the following theorem.

Theorem 2.

Provided that B.3-B.5 hold and our sequence $\{t_{n}\}$ satisfies

[TABLE]

it holds that for all $n$ large enough

[TABLE]

with $D_{n}$ from Theorem 1.

Proof.

See [3]. ∎

II-E Change in Minimizers Unknown

We now present an extension of the results in Section II-B, obtained by replacing $\rho$ with its estimate given in Section II-D. Our analysis depends on the following crucial assumption:

C.1

For appropriate sequences $\{t_{n}\}$ , for all $n$ sufficiently large it holds that $\hat{\rho}_{n}+t_{n}\geq\rho$ almost surely.

C.2

$b(d_{0},K_{n})$ factors as $b(d_{0},K_{n})=\alpha(K_{n})d_{0}^{2}+\beta(K_{n})$

We have demonstrated that assumption C.1 holds for the direct estimate of $\rho$ under (12) and (2). Note that whether we assume (12) or (2) does not matter for analysis. We start with a general result showing that for appropriate choices of $K_{n}$ , we control the excess risk.

Theorem 3.

Under assumptions C.1- C.2, with $K_{n}\geq K^{*}$ for all $n$ large enough, where $K^{*}$ is defined in (11), we have

[TABLE]

almost surely

Proof.

See [3]. ∎

This theorem shows that for any choice of samples $K_{n}$ such that $K_{n}\geq K^{*}$ for $n$ large enough, it follows that the excess risk can be controlled in the sense of (22).

II-E1 Update Past Excess Risk Bounds

We first consider updating all past excess risk bounds as we go. At time $n$ , we plug-in $\hat{\rho}_{n-1}+t_{n-1}$ in place of $\rho$ and follow the analysis of Section II-B. Define for $i=1,\ldots,n$

[TABLE]

If it holds that $\hat{\rho}_{n-1}+t_{n-1}\geq\rho$ , then ${\mathbb{E}\left[f_{n}(\bm{w}_{n})\right]-f_{n}(\bm{w}_{n}^{*})\leq\hat{\epsilon}_{n}^{(i)}}$ for ${i=1,\ldots,n}$ . Assumption C.1 guarantees that this holds for all $n$ large enough almost surely. We can thus set $K_{n}$ equal to the smallest $K$ such that

[TABLE]

for all $n\geq 3$ to achieve excess risk $\epsilon$ . The maximum in this definition ensures that when $\hat{\rho}_{n-1}+t_{n-1}\geq\rho$ , $K_{n}\geq K^{*}$ with $K^{*}$ from (11). We can therefore apply Theorem 3.

II-E2 Do Not Update Past Excess Risk Bounds

Updating all past estimates of the excess risk bounds from time $1$ up to $n$ imposes a computational and memory burden. Suppose that for all $n\geq 3$ we set

[TABLE]

This is the same form as the choice in (11) with $\hat{\rho}_{n-1}+t_{n-1}$ in place of $\rho$ . Due to assumption C.1, for all $n$ large enough it holds that $\hat{\rho}_{n}+t_{n}\geq\rho$ almost surely. Then by the monotonicity assumption in II-A, for all $n$ large enough we pick $K_{n}\geq K^{*}$ almost surely. We can therefore apply Theorem 3.

III Bound on $\rho$ -Estimate Overshoot

Since we assume that the solution space $\mathcal{X}$ has bounded diameter, we always have the trivial bound

[TABLE]

An estimate of the change in minimizers, $\hat{\rho}_{n}$ , is only interesting if the bound is non-trivial, i.e., $\hat{\rho}_{n}<\textrm{diam}(\mathcal{X})$ when $\rho<\textrm{diam}(\mathcal{X})$ . In prior work [3], we have proved the for sufficiently large $n$ , $\hat{\rho}_{n}+t_{n}\geq\rho$ almost surely. In this section, we look at proving an upper bound on how much $\hat{\rho}_{n}$ can overshoot $\rho$ to show that this estimate is non-trivial.

When we proved that $\hat{\rho}_{n}$ eventually upper bounds $\rho$ , we did not use the fact that the points $\bm{w}_{n}$ at which we are evaluating the one-step estimates are approximate minimizers. In particular, that proof would still hold even if we selected the $\bm{w}_{n}$ randomly from the solution space $\mathcal{X}$ without using the samples $\{\bm{z}_{n}(k)\}_{k=1}^{K_{n}}$ at all. In contrast, controlling the overshoot depends critically on the fact that the points at which we evaluate the one-step estimates are approximate minimizers. The solution quality of the approximate minimizers measured by $\epsilon$ in (3) will control the size of the overshoot, as seen in the following theorem.

Theorem 4.

Suppose that the following conditions hold:

The sequence of excess risks achieved, $\epsilon_{i}$ , $i=1,2,\ldots$ , satisfies

[TABLE] 2. 2.

The loss function $f_{n}(\bm{w})$ has Lipschitz continuous gradients with parameter $M$ , i.e.,

[TABLE] 3. 3.

For all $i$ large enough, we have that $K_{i}\geq\tilde{K}$ for a constant $\tilde{K}$ .

Then it follows that

[TABLE]

where

[TABLE]

Proof.

First, we look at the one step estimates. It holds that

[TABLE]

By the Lipschitz gradient assumption, we have

[TABLE]

Then it follows by strong convexity that

[TABLE]

and therefore we have

[TABLE]

Since the square-root is concave, by Jensen’s inequality, we have

[TABLE]

This in turn implies that

[TABLE]

Next, we look at bounding $\mathbb{E}\|\nabla f_{i}(\bm{w}_{i})-\hat{G}_{i}\|$ . Define

[TABLE]

Then we have

[TABLE]

Using the direct estimate lower bound analysis from [3] it follows that

[TABLE]

This shows that

[TABLE]

Then plugging in the definition of $\hat{\rho}_{n}$ it follows that

[TABLE]

∎

This shows that the direct estimate is a non-trivial upper bound for sufficiently small $\epsilon$ . Note that in practice, the $\tilde{K}$ will be a function of $\epsilon$ , since we can pick $\tilde{K}=K^{*}$ with $K^{*}$ defined in (11). Note that $K^{*}$ is itself a function of $\epsilon$ . This means the $G$ term in (28), which is a function of $\tilde{K}$ is also a function of $\epsilon$ . Thus the entire overshoot term is a function of $\epsilon$ , and in fact by inspection, it goes to zero as $\epsilon\to 0$ if $K^{*}\to\infty$ as $\epsilon\to 0$ (as $K^{*}$ defined in (11) does).

IV Extensions Relevant to Machine Learning Applications

IV-A Cost Approach

A natural way to assess the usefulness of our approach is to choose a number of samples $\{K_{n}\}_{n=1}^{T}$ over a horizon of length $T$ using the choice in (24) and (25), and compare against taking

[TABLE]

samples at time $n=1$ and no samples at the other $T-1$ time instants. See Section V for such a comparison.

In this section, we consider a different type of comparison based on assuming that there is a cost $p(K_{n})$ of taking $K_{n}$ samples. For example, we could have

[TABLE]

This implies we pay a fixed cost of $P_{0}$ any time we take at least one sample and a marginal cost of $P_{1}$ per sample. We want to control the excess risk by deciding when to take samples, and how many samples to take with a total budget $P$ over a horizon of length $T$ , i.e.,

[TABLE]

For the option of taking all samples up front:

[TABLE]

Another option is to sample every $\Delta T$ time instants and divide the cost budget evenly over the times that we take samples using

[TABLE]

For analysis, we need Assumption C.1 and the following additional assumptions:

D.1

There exists a function $e(\|\bm{w}-\bm{w}_{n}^{*}\|_{2}^{2})$ such that

[TABLE]

For example, suppose that the functions $f_{n}(\bm{w})$ have Lipschitz continuous gradients with modulus $M$ and $\bm{w}_{n}^{*}\in\text{int}(\mathcal{X})$ for all $n\geq 1$ , where $\text{int}(\mathcal{X})$ is the interior of $\mathcal{X}$ . By the descent lemma [31], we have

[TABLE]

Thus, we can set

[TABLE]

Since we need to consider the possibility that $K_{n}=0$ for some $n$ in $\{1,\ldots,T\}$ but still provide estimates of the excess risk, we need an alternate version of the bound in (23). Define

[TABLE]

where $t_{s}(n)$ is the last time no later than $n$ at which samples were taken. If no samples have been taken so far, then by convention $t_{s}(n)=+\infty$ . We construct the recursively defined function $\tilde{b}_{n}(\rho,K_{n})$ by considering the following four cases:

No samples have been taken by time $n$ :

[TABLE] 2. 2.

Samples taken at time $n$ for the first time

[TABLE] 3. 3.

No samples taken at time $n$ but samples have been taken previously

[TABLE] 4. 4.

Samples taken at time $n$ and samples have been taken previously

[TABLE]

where $\tilde{b}_{t_{s}(n-1)}$ is the bound on the excess risk at time $t_{s}(n-1)$ .

Suppose that over a time horizon of length $T$ we have a total cost budget $P$ with respect to the number of samples $\{K_{n}\}_{n=1}^{T}$ as in (30). Define the excess risk gaps

[TABLE]

with $(x)_{+}=\max\{x,0\}$ . The variable $\xi_{n}$ is the extent to which the target excess risk of $\epsilon$ is violated upwards. If our excess risk is below our target level $\epsilon$ , then we set $\xi_{n}=0$ . Our goal is to minimize the size of the $\xi_{n}$ , while taking into account the cost constraint in (30). To control the size of $\xi_{n}$ , suppose that we have a function $\phi:\mathbb{R}^{T}\to\mathbb{R}$ that describes the cumulative loss of the excess risk gaps $\xi_{1},\ldots,\xi_{T}$ .

We now provide some possible choices for $\phi(\xi_{1},\ldots,\xi_{T})$ :

[TABLE]

with

[TABLE]

The choices given in (33) and (34) penalize the average and maximum excess risk gaps, respectively. In practice, with these choices, we will stop taking samples before the horizon $T$ resulting in relatively poor performance towards the end of the horizon. The third choice gets around this problem by penalizing large increasing runs of excess risk gaps, and tends to favor a more uniform choice of the number of samples $K_{n}$ .

We first consider the case when $\rho$ is known to us and plan over the horizon of length $T$ by solving the following optimization problem:

[TABLE]

The idea of this problem is to satisfy the excess risk bound $\epsilon$ with minimal violation $\phi(\xi_{1},\ldots,\xi_{T})$ .

To estimate $\rho$ , we need samples from consecutive time instants. Therefore, we impose the constraint that if we take samples at time $n$ , then we must take samples at either time $n-1$ or time $n+1$ through the constraint

[TABLE]

The problem in (36) is a mixed integer non-linear programming problem (MINLP). There are no general methods to efficiently solve this MINLP, and we therefore consider a relaxation of this problem later.

In the case that we know $\rho$ , we can plan the number of samples ahead of time before any samples have been taken. When $\rho$ is unknown, we cannot plan over the entire horizon. Instead, at each time instant $m$ we have to plan over the remaining time horizon of length $T-m+1$ , while using the estimate $\hat{\rho}_{m-1}+t_{m-1}$ in place of $\rho$ and the remaining cost budget

[TABLE]

We then consider the cost-to-go problem

[TABLE]

This is the same form as (36), except that it is over the time horizon from $n=m,\ldots,T$ taking into account the portion of the cost budget that has been expended. In this problem, we only optimize over $K_{m},\ldots,K_{T}$ . This problem is again a MINLP.

Next, we look at approximate solutions to (36) and (37). The major difficulties in solving these programs are that the decision variables $\{K_{n}\}_{n=1}^{T}$ are integer-valued and the cost function $p(K)$ may be discontinuous at zero due to fixed costs. We consider relaxing $K_{n}$ to be real-valued and introduce a piecewise approximation $\hat{p}(K)$ of the cost functions $p(K)$ :

[TABLE]

Generally, we pick $0<K_{0}<1$ . We consider the relaxed program

[TABLE]

We also relax the indicator constraints to inequality to encourage taking samples at consecutive times. In practice, this forces more gradual changes in samples $K_{n}$ and makes it easier to solve these problems. This problem can be readily solved by gradient based solvers such as IPOPT [32].

When $\rho$ is unknown, we can repeatedly solve this problem using the latest estimate of $\rho$ by solving the following sequence of problems:

[TABLE]

IV-B Cross Validation

We can also apply cross-validation for model selection. Suppose we have loss functions $\ell_{\lambda}(\bm{w},\bm{z})$ parameterized by $\lambda$ , which controls the model complexity. For example, we could have a quadratic penalty term

[TABLE]

The value of $\lambda=0$ corresponds to the true loss function that we want to minimize. Suppose we have $C$ different values $\lambda^{(1)},\lambda^{(2)},\ldots,\lambda^{(C)}$ of $\lambda$ under consideration. For each $\lambda^{(i)}$ , we generate an approximate minimizer $\bm{w}_{n}^{(i)}$ of

[TABLE]

We want to select the value $\lambda^{(i)}$ and corresponding $\bm{w}_{n}^{(i)}$ that achieves the smallest loss

[TABLE]

We generate an approximate minimizer $\bm{w}_{n}^{(i)}$ for each problem in (40) starting from $\bm{w}_{n-1}^{(i)}$ . To select the best choice of $\lambda^{(i^{*})}$ in terms of minimizing (41), we apply cross-validation and set $\bm{w}_{n}=\bm{w}_{n}^{(i^{*})}$ [33].

The idea behind cross-validation is to divide the training samples $\{\bm{z}_{n}(k)]\}_{k=1}^{K_{n}}$ into $P$ equal sized pieces. For every $P-1$ out of $P$ pieces, we use the $P-1$ pieces of the training set to generate an approximate solution $\tilde{\bm{w}}_{n}^{(i)}$ to (40). We use the remaining piece of the training set to evaluate the empirical test loss achieved by $\tilde{\bm{w}}_{n}^{(i)}$ using a sample average approximation. We do this for every possible choice of $P-1$ out of $P$ pieces and average the empirical test loss estimates. We then select the value $\lambda^{(i^{*})}$ that achieves the smallest empirical test loss.

To apply cross-validation to our framework, we run $C$ parallel versions of our approach and at time $n$ we generate $C$ different choices for the number of samples $K_{n}^{(i)}$ . We then choose

[TABLE]

After choosing $K_{n}$ , we apply the usual cross-validation approach to select $\lambda^{(i)}$ for time $n$ . Fig. 1 shows this approach for two values of $\lambda$ .

V Experiments

We provide two regression examples for synthetic and real data as well as a classification example for synthetic data. For the synthetic regression problem, we can explicitly compute $\rho$ and $\bm{w}_{n}^{*}$ and exactly evaluate the performance of our method. It is straightforward to check that all requirements in II-A-II-A are satisfied for the problems considered in this section. We apply the “do not update past excess risk" choice of $K_{n}$ here.

V-A Synthetic Regression

Consider a regression problem with synthetic data using the penalized quadratic loss

[TABLE]

with $\bm{z}=(\bm{x},y)\in\mathbb{R}^{3}$ . We further assume that

[TABLE]

Under these assumptions, we can analytically compute minimizers $\bm{w}_{n}^{*}$ of $f_{n}(\bm{w})=\mathbb{E}_{\bm{z}_{n}\sim p_{n}}\left[\ell(\bm{w},\bm{z}_{n})\right]$ . We change only $r_{\bm{x}_{n},y_{n}}$ and $\sigma_{y_{n}}^{2}$ appropriately to ensure that $\|\bm{w}_{n}^{*}-\bm{w}_{n-1}^{*}\|_{2}=\rho$ holds for all $n$ . We find approximate minimizers using SGD with $\lambda=0$ . We estimate $\rho$ using the direct estimate.

We let $n$ range from $1$ to $20$ with $\rho=1$ , a target excess risk $\epsilon=0.1$ , and $K_{n}$ from (25). We average over twenty runs of our algorithm. Figure 2 shows $\hat{\rho}_{n}$ , our estimate of $\rho$ , which is above $\rho$ in general. Figure 3 shows the number of samples $K_{n}$ , which settles down. We can exactly compute $f_{n}(\bm{w}_{n})-f_{n}(\bm{w}_{n}^{*})$ , and so by averaging over the twenty runs of our algorithm, we can estimate the excess risk (denoted “sample average estimate”). We over the time horizon from $n=1$ to $25$ to yield the sample average estimate excess risk given by $2.797\times 10^{-2}\pm 1.071\times 10^{-2}$ . Therefore, we see that we achieve our desired excess risk.

V-A1 Cost Approach

We consider applying the cost approach in Section IV-A to the synthetic regression problem with the cost in (29). We compare the optimal cost approach introduced in (38) of Section IV-A to the approach in (25), taking all samples at time $n=1$ as in (31), and taking samples every five time instants as in (32). Note that the method from (25) does not satisfy the cost budget. Fig. 4 shows the test loss of these approaches. We achieve similar test loss to the method in (25) and better than the other two methods. Fig. 5 shows the number of samples selected for both methods. At some time instants, our optimal cost approach does not take samples.

This problem is an example of one where the initial distance term in $b(d_{0},K)$ per the discussion from Section III matters. This is evidenced by the fact that when we do not take samples after the first time instant the test loss can grow large quickly as shown in Fig. 4.

V-B Synthetic Classification

Consider a binary classification problem using

[TABLE]

with ${\bm{z}=(\bm{x},y)\in\mathbb{R}^{d}\times\mathbb{R}}$ and $(y)_{+}=\max\{y,0\}$ . This is a smoothed version of the hinge loss used in support vector machines (SVM) [33]. We suppose that at time $n$ , the two classes have features drawn from a Gaussian distribution with covariance matrix $\sigma^{2}\bm{I}$ but different means $\mu_{n}^{(1)}$ and $\mu_{n}^{(2)}$ , i.e.,

[TABLE]

The class means move slowly over uniformly spaced points on a unit sphere in $\mathbb{R}^{d}$ as in Figure 6 to ensure that the constant Euclidean norm condition defined in (12) holds. We find approximate minimizers using SGD with $\lambda=0.1$ . We estimate $\rho$ using the direct estimate.

We let $n$ range from $1$ to $25$ and target a excess risk $\epsilon=0.1$ . We average over twenty runs of our algorithm. As a comparison, if our algorithm takes $\{K_{n}\}_{n=1}^{25}$ samples, then we consider taking $\sum_{n=1}^{25}K_{n}$ samples up front at $n=1$ . This is what we would do if we assumed that our problem is not time varying. Figure 7 shows $\hat{\rho}_{n}$ , our estimate of $\rho$ . Figure 8 shows the average test loss for both sampling strategies. To compute test loss we draw $T_{n}$ additional samples $\{\bm{z}_{n}^{\text{test}}(k)\}_{k=1}^{T_{n}}$ from $p_{n}$ and compute $\frac{1}{T_{n}}\sum_{k=1}^{T_{n}}\ell(\bm{w}_{n},\bm{z}_{n}^{\text{test}}(k))$ . We see that our approach achieves substantially smaller test loss than taking all samples up front. We do not draw the error bars on this plot as it makes it difficult to see the actual losses achieved.

To further evaluate our approach we look at the receiver operating characteristic (ROC) of our classifiers. The ROC is a plot of the probability of a true positive against the probability of a false positive. The area under the curve (AUC) of the ROC equals the probability that a randomly chosen positive instance ( $y=1$ ) will be rated higher than a negative instance ( $y=-1$ ) [34]. Thus, a large AUC is desirable. Figure 9 plots the AUC of our approach against taking all samples up front. Our sampling approach achieve a substantially larger AUC.

V-C Panel Study on Income Dynamics Income - Regression

The Panel Study of Income Dynamics (PSID) surveyed individuals every year to gather demographic and income data annually from 1974-2012 [35]. We want to predict an individual’s annual income ( $y$ ) from several demographic features ( $\bm{x}$ ) including age, education, work experience, etc. chosen based on previous economic studies in [36]. The idea of this problem conceptually is to rerun the survey process and determine how many samples we would need if we wanted to solve this regression problem to within a desired excess risk criterion $\epsilon$ .

We use the same loss function, direct estimate for $\rho$ , and minimization algorithm as the synthetic regression problem. We average over twenty runs of our algorithm by resampling without replacement [33]. For the sake of comparison, given a choice of samples $\{K_{n}\}_{n=1}^{T}$ produced by our approach, we compare against taking $\sum_{n=1}^{T}K_{n}$ samples at time $n=1$ and none afterwards. Note that this is what we would do if we believed that the regression model does not change over time. We are aware of no other methods to select the number of samples $K_{n}$ to control the excess risk against which we could compare our approach.

Figure 10 shows the number of samples $K_{n}$ , which settles down quickly. Figure 11 shows $\hat{\rho}_{n}$ . Figure 12 shows the test losses over time evaluated over twenty percent of the available samples. The test loss for our approach is substantially less than that obtained by taking the same number of samples up front.

VI Conclusion

We introduced a framework for adaptively solving a sequence of learning problems. We developed estimates of the change in the minimizers used to determine the number of training samples $K_{n}$ needed to achieve a target excess risk $\epsilon$ . We introduced a cost based approach to select the number of samples and an approach to apply cross-validation. Experiments with synthetic and real data demonstrate that this approach is effective.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Wilson and V.V. Veeravalli, “Adaptive sequential optimization with applications to machine learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing , Shanghai, China, Mar. 2016, pp. 2642–2646.
2[2] C. Wilson and V. Veeravalli, “Adaptive sequential learning,” in 2016 50th Asilomar Conference on Signals, Systems and Computers , Nov. 2016, pp. 326–330.
3[3] C. Wilson, V.V. Veeravalli, and Angelia Nedić, “Adaptive sequential stochastic optimization,” ar Xiv:1610.01970 , Oct. 2016. (To appear in IEEE Transactions on Automatic Control, March 2019).
4[4] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning , The MIT Press, 2012.
5[5] A. Agarwal, H. Daumé, and S. Gerber, “Learning multiple tasks using manifold regularization.,” in Advances in Neural Information Processing Systems (NIPS) , 2011, pp. 46–54.
6[6] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , New York, NY, USA, 2004, KDD ’04, pp. 109–117, ACM.
7[7] Y. Zhang and D. Yeung, “A convex formulation for learning task relationships in multi-task learning,” Co RR , vol. abs/1203.3536, 2012.
8[8] S. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering , vol. 22, no. 10, pp. 1345–1359, Oct 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Adaptive Sequential Machine Learning

Abstract

I Introduction

I-A Related Work

I-B Paper Outline

II Adaptive Sequential Optimization

II-A Assumptions

II-B Change in Minimizers Known

II-C K∗K^{*}K∗* May Be Too Large*

II-D Estimating the Change in the Minimizers

II-D1 Estimating One Step Change

II-D2 Combining One Step Estimates For Constant Change

Theorem 1**.**

Proof.

II-D3 Combining One Step Estimates For Bounded Change

Theorem 2**.**

Proof.

II-E Change in Minimizers Unknown

Theorem 3**.**

Proof.

II-E1 Update Past Excess Risk Bounds

II-E2 Do Not Update Past Excess Risk Bounds

III Bound on ρ\rhoρ-Estimate Overshoot

Theorem 4**.**

Proof.

IV Extensions Relevant to Machine Learning Applications

IV-A Cost Approach

IV-B Cross Validation

V Experiments

V-A Synthetic Regression

V-A1 Cost Approach

V-B Synthetic Classification

V-C Panel Study on Income Dynamics Income - Regression

VI Conclusion

II-C $K^{}$ May Be Too Large*

Theorem 1.

Theorem 2.

Theorem 3.

III Bound on $\rho$ -Estimate Overshoot

Theorem 4.