Toward Optimal Run Racing: Application to Deep Learning Calibration

Olivier Bousquet; Sylvain Gelly; Karol Kurach; Marc Schoenauer,; Michele Sebag; Olivier Teytaud; Damien Vincent

arXiv:1706.03199·cs.LG·June 21, 2017

Toward Optimal Run Racing: Application to Deep Learning Calibration

Olivier Bousquet, Sylvain Gelly, Karol Kurach, Marc Schoenauer,, Michele Sebag, Olivier Teytaud, Damien Vincent

PDF

Open Access

TL;DR

This paper introduces a method for efficient neural network calibration through early stopping and multiple hypothesis testing, achieving state-of-the-art results without additional hyper-parameters.

Contribution

It presents a theoretically grounded approach for optimal run selection in deep learning calibration, reducing computational costs and improving performance.

Findings

01

Significant improvement over existing methods on Cifar10, PTB, and Wiki benchmarks.

02

The approach guarantees optimality within a multiple hypothesis testing framework.

03

No extra hyper-parameters required for the calibration process.

Abstract

This paper aims at one-shot learning of deep neural nets, where a highly parallel setting is considered to address the algorithm calibration problem - selecting the best neural architecture and learning hyper-parameter values depending on the dataset at hand. The notoriously expensive calibration problem is optimally reduced by detecting and early stopping non-optimal runs. The theoretical contribution regards the optimality guarantees within the multiple hypothesis testing framework. Experimentations on the Cifar10, PTB and Wiki benchmarks demonstrate the relevance of the approach with a principled and consistent improvement on the state of the art with no extra hyper-parameter.

Tables3

Table 1. Table 1: Comparative results of pruning criteria (a) [ 7 ] , (c), (e) and (f) (respectively in columns 3, 4, 5 and 6), with confidence δ = 0.5 𝛿 0.5 \delta=0.5 .

Testbed	Budget (number of runs)	Computational cost saved up by method (a) (equal to Domhan et al)	C.c. saved up by method (c): prediction-halt operator.	C.c. saved up by method (e): best-prediction-halt operator.	C.c. saved up by method “Clever-halt”
Cifar-adagrad	22	-89.6%	-93.5% FAIL by 1.083 $\to$ 1.164	-96.5% FAIL by 1.083 $\to$ 1.164	-89.6%
Cifar-adam	22	-89.2%	-91.5% FAIL by 0.80 $\to$ 0.92	-96.5% FAIL by 0.80 $\to$ 0.94	-89.2%
Cifar-gradient	22	-87.2%	-87.2%	FAIL by 1.18 $\to$ 1.95	-87.2%
Cifar-momentum	22	-85.5%	-96.2% FAIL by 1.03 $\to$ 1.45	-96.2% FAIL by 1.03 $\to$ 1.45	-85.5%
Miniwiki bits	250	-69.6%	-74.1% FAIL by 2.01 $\to$ 2.06	-76.6% FAIL by 2.01 $\to$ 2.52	-69.6%
Miniwiki bytes	250	-47.6%	-47.6%	-73.2% FAIL by 1.86 $\to$ 1.91	-46.9%
PTB bits	250	-68.4%	-74.5% FAIL by 1.40 $\to$ 1.56	-76.7% FAIL by 1.40 $\to$ 1.66	-69.7%
PTB bytes	250	-65.7%	FAIL by 1.298 $\to$ 1.317	FAIL by 1.298 $\to$ 1.419	-65.7%
PTB words	250	-72.1%	-75.1%	-76.6% FAIL by 1.18 $\to$ 1.24	-72.1%
Miniwiki bits	50	-72.1%	-73.0%	-76.3% FAIL by 2.02 $\to$ 2.54	-72.1%
Miniwiki bytes	50	-61.5%	-72.6% FAIL by 1.88 $\to$ 2.03	-76.7% FAIL by 1.88 $\to$ 2.19	-61.5%
PTB bits	50	-72.1%	-74.5% FAIL by 1.41 $\to$ 1.46	-76.7% FAIL by 1.41 $\to$ 1.70	-72.1%
PTB bytes	50	-62.1%	-72.9% FAIL by 1.31 $\to$ 1.40	-76.7% FAIL by 1.31 $\to$ 1.47	-62.1%
PTB words	50	-65.1%	-70.5% FAIL by 1.20 $\to$ 1.22	-76.5% FAIL by 1.20 $\to$ 1.28	-65.1%
Mini	7	-13.3%	-55.2% FAIL by 1.455 $\to$ 1.149	-76.7% FAIL by 1.455 $\to$ 1.499	-18.1%
Uncoupled bytes	5	-5.33%	-70% FAIL by 1.402 $\to$ 1.195	-76.7% FAIL by 1.402 $\to$ 1.625	-10.7%
Coupled bytes	5	-2.7%	-73.3% FAIL by 1.155 $\to$ 1.176	-73.3% FAIL by 1.155 $\to$ 1.176	-6.7%
Uncoupled words	5	-5.33%	-49.3% FAIL by 1.470 $\to$ 1.471	-76.7% FAIL by 1.470 $\to$ 1.767	-5.33%
Coupled words	5	-2.67%	-74% FAIL by 1.16 $\to$ 1.23	-75.3% FAIL by 1.16 $\to$ 1.23	-9.33%
Maxi	77	-4.91%	-17.6%	-21.93% FAIL by 1.482 $\to$ 1.490	-5.48%
MetaCifar	4x22	-42.5%	-62.7% FAIL by 1.03 $\to$ 1.10	-75.4% FAIL by 1.03 $\to$ 1.10	-42.5%
MetaNorm .N	32x1	-53.5%	-56.5% FAIL by 0.93 $\to$ 0.98	-56.9% FAIL by 0.93 $\to$ 0.98	-53.5%
MetaNorm AN	32x1	-68.9%	-69.6%	-69.6 %	-68.9%
MetaNorm anbn	32x1	-72.6%	-73.5% FAIL by $3 e - 4$	-73.5% FAIL by $3 e - 4$	72.6%
					No failure, best average performance.

Table 2. Table 2: Comparative results under same conditions as in Table 1 , where each pruning criterion is enriched with three simple conservative rules.

Testbed	Budget (number of runs)	Computational cost saved up by method (a) (equal to Domhan et al)	C.c. saved up by method (c): prediction-halt operator.	C.c. saved up by method (e): best-prediction-halt operator.	C.c. saved up by method “Clever-halt”
Cifar-adagrad	22	-56.5%	-56.5%	-56.5%	-56.5%
Cifar-adam	22	-88.1%	-94.1% FAIL by 0.797 $\to$ 0.923	-95.1% FAIL by 0.797 $\to$ 0.939	-88.1%
Cifar-gradient	22	-83.5%	-83.5%	-83.5%	-83.5%
Cifar-momentum	22	-82.5%	-92.5% FAIL by 1.034 $\to$ 1.451	-92.9% FAIL by 1.034 $\to$ 1.451	-82.5%
Miniwiki bits	250	-48.6%	-54.7% FAIL by 2.011 $\to$ 2.035	-54.7% FAIL by 2.011 $\to$ 2.035	-48.6%
Miniwiki bytes	250	-13.8%	-13.8%	-14.1%	-13.8%
PTB bits	250	-44.0%	-44.8% FAIL by 6e-4	-44.8% FAIL by 6e-4	-44.0%
PTB bytes	250	-20.6%	-21.2%	-21.2%	-20.6%
PTB words	250	-66%	-69.7%	-69.7%	-66%
Miniwiki bits	50	-41.8%	-42.7%	-42.7%	-41.8%
Miniwiki bytes	50	-43.8%	-49.9%	-49.9%	-43.8%
PTB bits	50	-50.1%	-50.2% FAIL by 2e-4	-50.2% FAIL by 2e-4	-50.1%
PTB bytes	50	-22%	-22%	-22%	-22%
PTB words	50	-60.5%	-68.7% FAIL by 1.201 $\to$ 2.215	-73.3% FAIL by 1.201 $\to$ 2.215	-60.5%
Mini	7	-13.3%	-51.9% FAIL by 1.451 $\to$ 1.455	-76.7% FAIL by 1.451 $\to$ 1.499	-18.1%

Table 3. Table 3: Comparative results under same conditions as in Table 2 , except for confidence δ = 0.01 𝛿 0.01 \delta=0.01 .

Testbed	Budget (number of runs)	Computational cost saved up by method (a) (equal to Domhan et al)	C.c. saved up by method (c): prediction-halt operator.	C.c. saved up by method (e): best-prediction-halt operator.	C.c. saved up by method “Clever-halt”
Cifar-adagrad	22	-50.2%	-50.2%	-57.2%	-50.2%
Cifar-adam	22	-85.8 %	-85.8%	-94.75% FAIL by 0.80 $\to$ 0.94	-85.9%
Cifar-gradient	22	-70.75%	-70.75%	-70.75%	-70.75%
Cifar-momentum	22	-63.3%	-81.1%	-81.1%	-63.3%
Miniwiki bits	250	-19%	-19%	-20.5%	-19%
Miniwiki bytes	250	-9.5%	-9.5%	-9.5%	-9.1%
PTB bits	250	-17.1%	-17.1%	-19.2%	-17.1%
PTB bytes	250	-17.7%	-17.7%	-17.7%	-17.7%
PTB words	250	-40.5%	-51.0%	-56.4%	-44.4%
Miniwiki bits	50	-32.7%	-32.7%	-32.7%	-32.7%
Miniwiki bytes	50	-30.5%	-30.5%	-57.8%	-32.7%
PTB bits	50	-27.1%	-27.1%	-27.1%	-27.1%
PTB bytes	50	-18%	-18%	-18%	-18%
PTB words	50	-49.6%	-58.1%	-65.7% FAIL by 1.2014 $\to$ 1.23329	-50.3%
			No fail, best mean performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Advanced Bandit Algorithms Research

MethodsEarly Stopping

Full text

Toward Optimal Run Racing:

Application to Deep Learning Calibration

Olivier Bousquet1, Sylvain Gelly1, Karol Kurach1,

Marc Schoenauer2, Michèle Sebag2, Olivier Teytaud1,

Damien Vincent1.

Google Brain, 2. TAU, Inria Saclay IDF.

[email protected]

Abstract

This paper aims at one-shot learning of deep neural nets, where a highly parallel setting is considered to address the algorithm calibration problem $-$ selecting the best neural architecture and learning hyper-parameter values depending on the dataset at hand. The notoriously expensive calibration problem is optimally reduced by detecting and early stopping non-optimal runs. The theoretical contribution regards the optimality guarantees within the multiple hypothesis testing framework. Experimentations on the Cifar10, PTB and Wiki benchmarks demonstrate the relevance of the approach with a principled and consistent improvement on the state of the art [7] with no extra hyper-parameter.

1 Introduction

The algorithm selection problem $-$ aimed at selecting a priori the learning algorithm best suited to a given dataset, and the algorithm calibration problem $-$ aimed at identifying the best hyper-parameter setting of an algorithm for the dataset at hand, have been acknowledged to be key issues since the late 80s [4, 3, 27, 2, 25, 20]. Several challenges have been organized to further investigate both algorithm selection and calibration issues in the last few years [10, 1].

The algorithm selection issue appears settled as of now, at least in the case where sufficient training data is available: deep learning (DL) consistently delivers dominant performances in many application domains, and is currently considered to the best learning algorithm in the large data regime [14, 8, 6, 18, 23]. This renders the algorithm calibration an even more critical issue: on the one hand, DL notoriously requires high computational resources; on the other hand, it involves a structured hyper-parameter space, hindering the approximation of the performance model. Automatic algorithm calibration thus is challenged by manual algorithm calibration, as noted by [7]. As the experienced practitioner can easily detect and stop the unpromising runs based on their learning curves in the first epochs, she can afford to consider many more hyper-parameter settings.

How to discard as early as possible runs/solutions that will eventually yield under-optimal results has long and thoroughly been investigated (section 2). The early discard decision problem raises two interdependent questions: uncertainty modelling, as the eventual quality of a run result is unknown until the run is achieved; risk control, as one needs guarantees that the run which would have yielded the best result has not been stopped.

This paper addresses the early discard problem in the context of parallel one-shot deep neural training. Formally, the considered framework, referred to as parallel one-shot run race (PaRR), allocates all available computational resources at the beginning of the period to train deep neural nets; each core runs with its specific hyper-parameter setting, or configuration, with no communication among the cores. The goal is to make DL robust w.r.t. random hazards (e.g. initializations) and bad decisions (e.g. configuration set), eventually delivering the optimal configuration/learned model, with a minimal computational budget. The challenge lies in making the stopping decision with little and censored evidence: as all runs are simultaneous, only prior information about the learning curves behavior is available, as in [7]. Formally, the PaRR problem is a constrained optimization problem: i) the constraint regards the guarantees about eventually delivering the optimal result, i.e. ensuring that the best run lives until the end of the period; ii) the optimization consists of minimizing the computational budget subject to the optimality guarantee, by stopping any run (with no possible resuming of the run, as opposed to [26]) as early as possible.

The present paper, building upon the current best approach [7], makes theoretical and empirical contributions. On the theoretical side and with no additional hyper-parameter, a principled approach is used to set the pruning thresholds; furthermore, guarantees are obtained through a principled treatment of the multiple hypothesis testing issue. On the empirical side, experimentations on the Cifar [13], PTB [17], and MiniWiki [11] benchmarks show a consistent improvement compared to [7], for each and every hyper-parameter setting.

This paper is organized as follows. Section 2 briefly discusses related work. Section 3 formulates the PaRR problem together with different types of statistical risk and associated criteria. The proposed PaRR decision maker, with the different variants associated to the risks, are empirically assessed and compared to the state of the art in section 4.

2 Related work

Performance modelling.

In the domain of algorithm selection and calibration, a usual approach is to build a performance model [22], predicting the eventual performance of the algorithm based on its hyper-parameter configuration and on the description of the problem instance at hand. In the Machine Learning domain however, in contrast with the SAT and CSP domains, to our best knowledge there does not exist yet an affordable feature set, able to accurately describe a problem instance and to support the prediction of an algorithm performance on this instance. For this reason, algorithm selection and calibration in Machine Learning (see e.g. [3, 24, 28]), builds online an instance-dependent performance model, learned using Gaussian Processes [3, 24], Random Forests [28] or radius-based functions [12]). In most cases [3, 24, 28] the performance model is used along Bayesian Optimization principles [19] to determine the most promising algorithm configuration. In [12], coordinate-based optimization reports good results, particularly so in high-dimensional hyper-parameter space.

By construction, the above approaches are intrinsically sequential, making it difficult to use the above performance models to stop unpromising runs.

A first extension overcoming the sequential issue is proposed by [26]. The instance-dependent Gaussian Process model built from the available learning curves111In the following, learning curve denotes the available evidence about a configuration, reporting the performance w.r.t. the number of epochs so far. is used to decide whether to freeze a run or start another run. Overall, [26] maintains a basket of runs, typically involving 10 alive (non-frozen) runs and 3 new ones, where the decision is based on the maximum "asymptotic" performance reached on this learning curve according to Expected Improvement. In each round, the GP model is updated and the basket of runs is recomposed.

Another approach is that of [7], with two differences compared to [26]. Firstly, the domain knowledge is leveraged to select 11 models best reflecting the usual learning curves (ranging from vapor-pressure to Weibull law; see [7] for more detail), and referred to as basic models in the following. The ensemble of these basic models constitute a parametric ensemble modelling space, including the parameters of each model and the weight of each model in the ensemble. Each learning curve is exploited using Bayesian inference to derive a posterior distribution on the ensemble modelling space, best accounting for this learning curve. The exploitation of this posterior distribution via MCMC supports an estimation of the performance that might be reached later on this learning curve, and the confidence thereof. Finally, based on a (user-supplied) confidence level $\delta$ , a learning curve is halted whenever the probability that its eventual performance improves on the best-so-far performance is less than $\delta$ .

Multi-Armed Bandit.

Another approach to parallel online optimization and pruning is based on the Multi-Armed Bandit framework, offering rigorous guarantees about the optimal allocation of trials. In [16], the problem of hyper-parameter optimization is formulated as a pure exploration adaptive resource allocation problem. The approach builds upon the Successive-Halving process proposed by [9], which most simply prunes 50% of the runs with lowest current performance, until a single configuration remains; each run corresponds to a (uniformly sampled) configuration. Naturally, the overall performance of Successive-Halving critically depends on the initial allocated computational resources. The Hyperband approach [16] addresses this limitation using a infinitely-many arm bandit approach on the space of number $n$ of configurations to be considered in parallel, times computational time $r$ allocated between two pruning steps, where the instant reward associated to an $(n,r)$ pair is the best learning performance achieved by Successive-Halving $(n,r)$ .

Discussion.

Compared to Hyperband, performance modelling offers two significant advantages. Firstly, it makes it possible to prune an arbitrary number of runs whenever an excellent one is found; in contrast, Hyperband does not allow learning across runs; each trial consider a new iid configuration sample. Secondly, Hyperband involves a fixed discarding rate, determining the fraction of pruned runs in each Successive-Halving step. In the Deep Learning context however, validation curves are very noisy at the beginning, and some hyper-parameters (e.g. when the learning rate decay starts) have a delayed impact, making all learning curves very similar in the early steps. In such contexts, early pruning is mostly random.

On the other side, performance modelling does not allow for an efficient pruning in the parallel setting. Typically, whenever several runs are very similar and close from the best-so-far one, the parallel approach proposed by [7] is bound to keep them all.

Our goal, as said, is to achieve an optimal pruning under the optimality constraint (preserving the optimal performance out of the initial set of configurations). To this aim, the contributions described in next section will focus on how to use the performance model in order to adjust the selection threshold, and how to address the multiple hypothesis testing issue in a consistent way.

3 Overview of PaRR

The presented approach relies on performance modelling and closely follows the approach of [7].The same 11 basic models are used222As the goal is a minimization one, the model $f_{\theta}(x)$ becomes $\theta^{\prime}-f_{\theta}(x)$ with $\theta^{\prime}$ an additional parameter.. All attempts to reduce the number of models resulted in lesser performances. Each learning curve (validation-error( $t$ ), for $t=1\ldots$ current epoch) derives a posterior distribution on the ensemble modelling space, using Bayesian inference from the same un-informative prior. This posterior distribution is likewise used by MCMC to derive an estimate of the validation-error for $t^{\prime}>t$ , together with the confidence thereof. The overall computational budget is finite, with $t<T$ . For simplicity and by abuse of language, we will refer to asymptotic properties to designate properties that are true at epoch $T$ .

Criteria for halting a run. Six criteria are presented below, to make the decision of halting a learning curve (halting the run with no later resuming). These criteria are parameterized by a confidence threshold $\delta\in[0,1]$ , as in [7]. Another quantity involved in these criteria is the current best performance noted $y^{*}(t)$ and the predicted asymptotic result of the current best learning curve, noted $\hat{y}^{*}(T)$ .

(a)

Default halt operator: a run is halted if the probability that it performs asymptotically better than the current best is less than $\delta$ . This is the criterion used by [7]).

If we trust the confidence intervals and if the validation error is noise-free, this criterion has a probability at least $1-\delta$ not to halt an optimal curve. At each given time step, the probability of a mishalt (i.e., halt of the optimal run) is therefore bounded by $\delta$ . 2. (b)

a run is halted if the probability that it performs asymptotically better than the predicted asymptotic result of the current best is less than $\delta$ .

Compared to (a), the bound now is the predicted expected performance of the current best, instead of the current result of the current best. But this criterion is risky, as it does not use the confidence interval of the current best; it will therefore not be mentioned any more here, subsumed by next criterion (c). 3. (c)

Prediction-halt operator: a run is halted if the probability that it performs asymptotically better than a conservative estimate of the predicted asymptotic result of the current best is less than $\delta$ .

This criterion uses an upper bound on the asymptotic performance of the current best run. The decision hence depends on two curve models. There is thus, there is a probability of at least $1-2\delta$ not to halt an optimal run: the probability f mishalt is therefore bounded by $2\delta$ .

A drawback of both (b) and (c) is how they handle currently poor runs that present a steep improvement, whereas the current best is stagnating. In such case, the criterion might be too conservative. This leads to proposing the following criteria (d) and (e), counterparts of criteria (b) and (c) but using the overall best conservative prediction instead of the conservative prediction of the current best. 4. (d)

a run is halted if the probability that it performs asymptotically better than the overall best of all predicted asymptotic results is less than $\delta$ . 5. (e)

Best-prediction halt operator: a run is halted if the probability that it performs asymptotically better than the overall best conservative estimate of all predicted asymptotic results is less than $\delta$ .

There is however a subtle side-effect with this operator: if the probability of failure is $1-\delta$ for each run, and there are $n$ runs, then the cumulated risk can be $1$ !. This is why we propose the following criterion: 6. (f)

Clever-halt operator: a run is halted if the probability that it performs asymptotically better than the $k^{th}$ best predicted asymptotic results is less than $\delta$ . We will now discuss the choice of $k$ .

Let us assume that there are $n$ competing runs, and that the probability that one given curve modelling fails in providing an upper or lower bound is at most $\delta$ (we trust our models with confidence $\delta$ ). Then, the probability that at least one curve is poorly modeled is less than $1-(1-\delta)^{n}$ – and one poor modeling can lead to the failure of methods (d) or (e).

Furthermore, this implies that the probability either the best asymptotic run or the current best run is poorly modeled is less than $2\delta$ : this justifies method (c), but not (b).

Finally, the probability that at least $k$ curves are poorly modeled can be made less than $\delta$ by choosing $k$ sufficiently large. Here, using the Gaussian approximation of this probability, we want that $k$ satisfies $P({\cal N}(0,1)\geq\frac{k-n\delta}{\sqrt{n\delta(1-\delta)}})\leq\delta$ . We select $k$ numerically as the smallest integer satisfying this inequality. As $k$ depends on $n$ , it will vary from one context to the next.

4 Experiments

This section presents the experimental setting and the experimental methodology followed to empirically compared the proposed PaRR pruning criteria. The first experimental results (section 4.3) suggest some simple heuristic improvements (section 4.4).

4.1 Experimental setting

Two families of large-size problems are considered, within the domain of language modelling and classification. In the former case, the loss is expressed in terms of perplexity (bits-per-unit, when predicting the next word). In the latter one, the loss is the cross-entropy one unless otherwise stated. Due to the large size of the datasets, only the validation error is considered; the overfitting issue is beyond the scope of the paper.

Five language modelling tasks are considered. PTB [17] aims at language modeling at the bit, byte and word levels. MiniWiki is a subset of the Hutter dataset [11] (also referred to as enwik8.zip); the size of the training set is 6% of the overall size; the modelling task is at the bit and byte level. The considered neural architecture involves 3 stacked LSTM with 500 units, batch size 50, 30 unrolling steps, with a budget of 30 epochs. The hyper-parameters (uniformly and independently drawn) are the weight init scale (in $[0.02,1]$ ), the learning rate (in $[5,100]$ ), the dropout keep probability (in $[0.2,1]$ ), the clipping gradient norm (in $[0.05,1]$ ).

Four classification settings are considered, all based on the Cifar10 dataset [13], and involving four learning rate adaptation methods, namely Adagrad, Adam, Gradient, and Momentum. The NN architecture is made of 3 convolutional layers, with filter size 7x7, max-pooling, stride 1, 512 chanels; followed by a convolutional layer with filter size 5x5 and 64 chanels; followed by two fully connected layers with 384 and 192 units respectively. In all cases, the batch size is 64 with a budget of 200 epochs. The hyper-parameters (uniformly and independently drawn) include the weight init scale (in $[0.001,100]$ ), the weight init scale for convolutional layers (in $[0.001,0.1]$ ), the learning rate (in $[0.00001,10]$ ), the clipping gradient norm (in $[0.01,10]$ ), the number of epochs before learning rate decay (in $[12,198]$ ), the learning rate decay (in $[0.9,0.999]$ ) and the dropout keep probability (in $[0.8,1]$ ) for non-convolutional layers.

Specific experiments are considered for investigation:

Mini: considers the PTB word prediction task, with a small net with 2 layers of 20 units, 30 unrolling steps, learning rate 35, dropout 0.5, gradient clipping norm 0.143, 30 epochs, cell clipping [5] between $1$ and $10000$ .

Maxi: only differs from Mini as it considers more runs in parallel.

Coupled: considers the PTB bytes or word prediction task, with a larger net involving 2 LSTM with 650 units, optimized by stochastic gradient descent, optionally with coupled input and forget gates, other values as in Mini.

MetaCifar: operates on four experiments, each one considering 22 runs with a different learning rate adaptation method (Adagrad, Adam, Gradient, and Momentum).

Metanormalization: considers 32 runs with different variants of normalization for language modeling with three different toy sequences: (i) rote learning of sequences “.M” made of repetition of identical words; (ii) sequences ‘AN” repeating identical repetitions of same length words made of a same letter (but with possibly different lengths for different sequences); (iii) sequences of the form “anbn” (aaabbb aaaaaabbbbb…) with varying $n$ .

4.2 Experimental methodology

In the remainder of the paper, a failure (FAIL) stands for halting the best run. The main components for the PaRR decisions are: i) the confidence threshold $\delta$ needed to prune a run; ii) the comparison threshold: the pruning decision is taken if the predictive performance of a run is below the comparison threshold with confidence at least 1 - $\delta$ , and iii) the overall computational budget. The sensitivity w.r.t. the confidence value is discussed in section 4.3. The impact of the overall computational budget (here, the number of simultaneous runs) is dramatic, as illustrated on Fig. 1 in the case of the PTB modelling task at the bit level. 50 runs are launched in parallel, with an allowed number of epochs set to 30, 15, and 7. Empirically, circa 30 independent parallel runs are required to eventually deliver a "reasonably optimal" performance. Note that this experimental setting is a typical one, with a very significant computational cost; hence the need for the present work.

Additionally, Fig. 1 empirically demonstrates that the early ranks of the runs can be very misleading.

The experimental observations on a given problem are illustrated on the Mini case (Fig. 2) for $\delta=.5$ .

In the following, the performance of a pruning criterion on a given problem is assessed depending on whether it failed and halted the best run (FAIL); otherwise, it reports the computational savings.

4.3 Pruning with confidence $\delta=.5$

Table 1 reports the results of criteria (a) (the baseline [7]); (c), (e) and (f) on all experimental cases, for $\delta=.5$ .

Unexpectedly, although the confidence is very low (which might imply at first sight that the probability of FAIL is ca 50%), these results are very good. A tentative interpretation for this fact is that, although the confidence is low, the comparison threshold is set to the best validation error so far, which is a quite conservative threshold. With a low confidence $\delta$ , more aggressive comparison thresholds entail failures. Criterion (c), considering the predicted error threshold, fails. Criterion (e), considering the best predicted validation error, fails even more often. Overall, the baseline method (a) and method (e) do not fail. The computational savings are such that method (e) wins over (a) in 5 experimental cases, and (a) wins over (e) in 1 experimental case.

4.4 Conservative pruning rules

A natural question raised from the empirical results (Table 1) is whether undesirable failures could be prevented using simple heuristic conservative rules, such as: i) never prune the current best; ii) discard all predictions of a negative loss; iii) discard predictions with correlation data/observation less than 0.5. Accordingly, the experiments are repeated by enriching all pruning criteria with the conservative rules (Table 2). Although these simple rules do save some failures for methods (c) and (e), the overall conclusions remain the same as from Table 1: methods (a) and (f) are the only safe ones, with (e) slightly outperforming (a) in terms of computational savings. As expected, the overall gain is eroded as the aggressive (c) and (e) methods do no longer fail on 4 out of the 15 problems.

4.5 Pruning with low confidence $\delta$ and conservative rules

Assuming that the conservative heuristic rules will prevent aggressive pruning criteria from most failures, a question is whether better savings can be obtained by lowering the confidence threshold. As shown in Table 3 (supplementary material), a lower $\delta=.01$ does significantly improve the results for criterion (c) (though dominated by the results of criterion (f) for $\delta=.5$ ). For criterion (e) however, failures are still observed; the interpretation for this fact (in agreement with the multiple hypothesis testing framework indeed) is that increasing the number of tests is a strong factor of failure. Interestingly, the overall results are globally worse than for $\delta=0.5$ .

Other confidence levels ( $\delta=0.1$ , $\delta=0.3$ , $\delta=0.05$ , $\delta=0.01$ ) have been considered (results in supplementary material). Fig. 3 graphically displays the performance of the pruning criterion (f) compared to the baseline (a) [7].

5 Conclusions

A first contribution of the presented work is to confirm the relevance of the pruning method proposed by [7], with computational savings often above 85% (particularly so for image applications), and applicable at both levels of randomized hyper-parameter optimization, and model selection (the model selection problem itself embedding a hyper-parameter selection problem). Cumulative gains from both levels can decrease the computational cost by more than an order of magnitude as shown on the Cifar and Metacifar experiments. Interestingly, the sensitivity of the approach w.r.t. the confidence threshold $\delta$ reveals itself to be low. Set to $0.05$ in [7], we show that it can be increased up to $0.5$ ; this stability is explained from the conservative modelling of the performance in the considered settings. Our second contribution is a new and more principled pruning method, slightly but significantly outperforming the former method for all confidence threshold $\delta$ , with a excellent stability with respect to $\delta$ (see Fig.3 and supplementary material). The novelty: of the proposed approach is to refine the halting threshold using predictions, and use a quantile of the predictions (as opposed to, the best prediction), in an adaptive manner. The robustness of this approach relies on its consistent grounding on the multiple hypothesis testing framework. A first research perspective is to introduce diversity-based pruning for ensemble methods [15], taking inspiration from the clearing methods in multi-modeal optimization [21]. A second perspective will investigate whether the quantile-based proposed approach can make the inference simpler (as opposed to the Metropolis-Hasting method used in [7]). Finally, combining the proposed approach with mainstream Bayesian Optimization, exploiting both the validation curves and the structure of the hyperparameter domain, would allow to learn accross runs.

Appendix A Experiments with other values of $\delta$

A.1 Comparison between criteria with $\delta=0.1$

[TABLE]

A.2 Comparison between criteria with $\delta=0.3$

We checked the stability of the method by also checking what is going on at $\delta=0.3$ . We still get the best results with our method.

[TABLE]

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Challenges in Machine Learning , 2015.
2[2] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter tuning. In Proceedings of the 30th International Conference on Machine Learning (ICML) , volume 28, pages 199–207, 2013.
3[3] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In P. B. F. P. K. W. J. Shawe-Taylor, R.S. Zemel, editor, Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS) , volume 24 of Advances in Neural Information Processing Systems , Granada, Spain, 2011.
4[4] P. Brazdil and C. Soares. A comparison of ranking methods for classification algorithm selection. In R. L. de Mántaras and E. Plaza, editors, the 11th European Conference on Machine Learning (ECML) , volume 1810 of LNCS , pages 63–74. Springer, 2000.
5[5] W. Chan and I. Lane. Deep recurrent neural networks for acoustic modelling. Co RR , abs/1504.01482, 2015.
6[6] L. Deng, G. E. Hinton, and B. Kingsbury. New types of deep neural network learning for speech recognition and related applications: an overview. In ICASSP , pages 8599–8603. IEEE, 2013.
7[7] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI , pages 3460–3468. AAAI Press, 2015.
8[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML , volume 32 of JMLR Workshop and Conference Proceedings , pages 647–655. JMLR.org, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Toward Optimal Run Racing:

Abstract

1 Introduction

2 Related work

Performance modelling.

Multi-Armed Bandit.

Discussion.

3 Overview of PaRR

4 Experiments

4.1 Experimental setting

4.2 Experimental methodology

4.3 Pruning with confidence δ=.5\delta=.5δ=.5

4.4 Conservative pruning rules

4.5 Pruning with low confidence δ\deltaδ and conservative rules

5 Conclusions

Appendix A Experiments with other values of δ\deltaδ

A.1 Comparison between criteria with δ=0.1\delta=0.1δ=0.1

A.2 Comparison between criteria with δ=0.3\delta=0.3δ=0.3

4.3 Pruning with confidence $\delta=.5$

4.5 Pruning with low confidence $\delta$ and conservative rules

Appendix A Experiments with other values of $\delta$

A.1 Comparison between criteria with $\delta=0.1$

A.2 Comparison between criteria with $\delta=0.3$