An analysis of the cost of hyper-parameter selection via split-sample validation, with applications to penalized regression
Jean Feng, Noah Simon

TL;DR
This paper investigates how the generalization error grows with the number of hyper-parameters in model selection, providing finite-sample bounds and analyzing penalized regression with multiple penalties.
Contribution
It establishes finite-sample oracle inequalities for hyper-parameter tuning via split-sample validation and cross-validation, especially for penalized regression with multiple penalties.
Findings
Error from hyper-parameter tuning shrinks at nearly parametric rate for smooth models.
Adding hyper-parameters is akin to adding model parameters in parametric cases.
Lipschitz continuity of penalized models supports multiple penalty parameters.
Abstract
In the regression setting, given a set of hyper-parameters, a model-estimation procedure constructs a model from training data. The optimal hyper-parameters that minimize generalization error of the model are usually unknown. In practice they are often estimated using split-sample validation. Up to now, there is an open question regarding how the generalization error of the selected model grows with the number of hyper-parameters to be estimated. To answer this question, we establish finite-sample oracle inequalities for selection based on a single training/test split and based on cross-validation. We show that if the model-estimation procedures are smoothly parameterized by the hyper-parameters, the error incurred from tuning hyper-parameters shrinks at nearly a parametric rate. Hence for semi- and non-parametric model-estimation procedures with a fixed number of hyper-parameters, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
An analysis of the cost of hyper-parameter selection via split-
sample validation, with applications to penalized regression
Jean Feng, Noah Simon
Department of Biostatistics, University of Washington
Abstract: In the regression setting, given a set of hyper-parameters, a model-estimation procedure constructs a model from training data. The optimal hyper-parameters that minimize generalization error of the model are usually unknown. In practice they are often estimated using split-sample validation. Up to now, there is an open question regarding how the generalization error of the selected model grows with the number of hyper-parameters to be estimated. To answer this question, we establish finite-sample oracle inequalities for selection based on a single training/test split and based on cross-validation. We show that if the model-estimation procedures are smoothly parameterized by the hyper-parameters, the error incurred from tuning hyper-parameters shrinks at nearly a parametric rate. Hence for semi- and non-parametric model-estimation procedures with a fixed number of hyper-parameters, this additional error is negligible. For parametric model-estimation procedures, adding a hyper-parameter is roughly equivalent to adding a parameter to the model itself. In addition, we specialize these ideas for penalized regression problems with multiple penalty parameters. We establish that the fitted models are Lipschitz in the penalty parameters and thus our oracle inequalities apply. This result encourages development of regularization methods with many penalty parameters.
Key words and phrases: Cross-validation, Regression, Regularization.
2 Introduction
Per the usual regression framework, suppose we observe response and predictors . Suppose is generated by a true model plus random error with mean zero, e.g. . Our goal is to estimate . Many model-estimation procedures can be formulated as selecting a model from some function class given training data and -dimensional hyper-parameter vector . For example, in penalized regression problems, the fitted model can be expressed as the minimizer of the penalized training criterion
[TABLE]
where are penalty functions and are penalty parameters that serve as hyper-parameters of the model-estimation procedure.
If is a set of possible hyper-parameters, the goal is to find a penalty parameter that minimizes the expected generalization error Typically one uses a sample-splitting procedure where models are trained on a random partition of the observed data and evaluated on the remaining data. One then chooses the hyper-parameter that minimize the error on this validation set. For a more complete review of cross-validation, refer to Arlot et al. (2010).
The performance of split-sample validation procedures is typically characterized by an oracle inequality that bounds the generalization error of the expected model selected from the validation set procedure. For that are finite, oracle inequalities have been established for a single training/validation split (Györfi et al., 2006) and a general cross-validation framework (Van Der Laan and Dudoit, 2003; van der Laan et al., 2004). To handle over a continuous range, one can use entropy-based approaches (Lecué and Mitchell, 2012).
The goal of this paper is to characterize the performance of models when the hyper-parameters are tuned by some split-sample validation procedure. We are particularly interested in an open question raised in Bengio (2000): what is the “amount of overfitting… when too many hyper-parameters are optimized”? In addition, how many hyper-parameters is “too many”? In this paper we show that actually a large number of hyper-parameters can be tuned without overfitting. In fact, if an oracle estimator converges at rate , then the number of hyper parameters can grow at roughly a rate of up to log terms without affecting the convergence rate. In practice, for penalized regression, this means that one can propose and tune over much more complex models than are currently often used.
To show these results, we prove that finite-sample oracle inequalities of the form
[TABLE]
are satisfied with high probability for some constant and remainder that depends on the number of tuned hyper-parameters and the number of samples . Under the assumption that the model -estimation procedure is Lipschitz in the hyper-parameters, we find that scales linearly in . For parametric model-estimation procedures, the additional error from tuning hyper-parameters is roughly , which is similar to the typical parametric model-estimation rate where the model parameters are not regularized. For semi- and non-parametric model-estimation procedures, this error is generally dominated by the oracle risk so we can actually grow the number of hyper-parameters without affecting the asymptotic convergence rate.
In addition, we specialize our results to penalized regression models of the form (2.1). The models in our examples are Lipschitz so that our oracle inequalities apply. This suggests that multiple penalty parameters may improve the model estimation and that the recent interest in combining penalty functions (e.g. elastic net and sparse group lasso (Zou and Hastie, 2003; Simon et al., 2013)) may have artificially restricted themselves to two-way combinations.
During our literature search, we found few theoretical results relating the number of hyper-parameters to the generalization error of the selected model. Much of the previous work only considered tuning a one-dimensional hyper-parameter over a finite , proving asymptotic optimality (van der Laan et al., 2004) and finite-sample oracle inequalities (Van Der Laan and Dudoit, 2003; Györfi et al., 2006). Others have addressed split-sample validation for specific penalized regression problems with a single penalty parameter, such as linear model selection (Li, 1987; Shao, 1997; Golub et al., 1979; Chetverikov and Liao, 2016; Chatterjee and Jafarov, 2015). Only the results in Lecué and Mitchell (2012) are relevant to answering our question of interest. A potential reason for this dearth of literature is that, historically, tuning multiple hyper-parameters was computationally difficult. However there have been many recent proposals that address this computational hurdle (Bengio, 2000; Foo et al., 2008; Snoek et al., 2012).
Section 3 presents oracle inequalities for sample-splitting procedures to understand how the number of hyper-parameters affects the model error. Section 4 applies these results to penalized regression models. Section 5 provides a simulation study to support our theoretical results. Oracle inequalities for general model-estimation procedures and proofs are given in the Supplementary Materials.
3 Oracle Inequalities
Here we establish oracle inequalities for models where the hyper-parameters are tuned by a single training/validation split and cross-validation. We are interested in studying model-estimation procedures that vary smoothly in their hyper-parameters; such procedures tend to be easier to use and therefore tend to be more popular.
Let denote a dataset with samples. Given dataset training data , let be some model-estimation procedure that maps hyper-parameter to a function in . We assume the following Lipschitz-like assumption on the model-estimation procedure. In particular, we suppose that for any , the predicted value is Lipschitz in :
Assumption 1**.**
Suppose there is a set such that for any and dataset , there is a function such that for any , we have for all
[TABLE]
We provide examples of penalized regression models that satisfy this assumption in Section 4.
3.1 A Single Training/Validation Split
In the training/validation split procedure, the dataset is randomly partitioned into a training set and validation set with and observations, respectively. The selected hyper-parameter is a minimizer of the validation loss
[TABLE]
where for function .
We now present a finite-sample oracle inequality for the single training/validation split assuming Assumption 1 holds. Our oracle inequality is sharp, i.e. in (2.2), unlike most other work (Györfi et al., 2006; Lecué and Mitchell, 2012; Van Der Laan and Dudoit, 2003). Note that the result below is a special case of Theorem 3 in Supplementary Materials A.1, which applies to general model-estimation procedures.
Theorem 1**.**
Let where . Suppose random variables from the validation set are independent with expectation zero and are uniformly sub-Gaussian with parameters and :
[TABLE]
Let the oracle risk be denoted
[TABLE]
Suppose Assumption 1 is satisfied over the set . Then there is a constant only depending on and such that for all satisfying
[TABLE]
we have
[TABLE]
Theorem 1 states that with high probability, the excess risk, e.g. the error incurred during the hyper-parameter selection process, is no more than . As seen in (3.6), is the maximum of two terms: a near-parametric term and the geometric mean of the near-parametric term and the oracle risk. To see this more clearly, we express Theorem 1 using asymptotic notation.
Corollary 1**.**
Under the assumptions given in Theorem 1, we have
[TABLE]
Corollary 1 show that the risk of the selected model is bounded by the oracle risk, the near-parameteric term (3.9), and the geometric mean of the two values (3.10). We refer to (3.9) as near-parametric because the error term in (un-regularized) parametric regression models is typically , where is the parameter dimension and is the number of training samples. Analogously, (3.9) is modulo a term in the numerator. The geometric mean (3.10) can be thought of as a consequence of tuning hyper-parameters over
[TABLE]
As does not (or is very unlikely to) contain the true model , tuning the hyper-parameters via training/validation split is tuning over a the misspecified model class. The geometric mean takes into account this misspecification error.
In the semi- and non-parametric regression settings, the oracle error usually shrinks at a rate of where . If the number of hyper-parameters is fixed and is large, the oracle risk will tend to dominate the upper bound. Hence for such problems, we can actually let the number of hyper-parameters grow – the asymptotic convergence rate of the upper bound will be unchanged as long as grows no faster than
3.2 Cross-Validation
Now we give an oracle inequality for -fold cross-validation. Previously, the oracle inequality was with respect to the -norm over the validation covariates. Now we give our result with respect to the functional -norm. We suppose our dataset is composed of independent identically distributed observations where is independent of . The functional -norm is defined as .
For -fold cross-validation, we randomly partition the dataset into sets, which we assume to have equal size for simplicity. Partition will be denoted and its complement will be denoted . We train our model using for and select the hyper-parameter that minimizes the average validation loss
[TABLE]
In traditional cross-validation, the final model is retrained on all the data with . However bounding the generalization error of the retrained model requires additional regularity assumptions (Lecué and Mitchell, 2012). We consider the “averaged version of -fold cross-validation” instead
[TABLE]
To bound the generalization error of (3.13), we require an assumption in Lecué and Mitchell (2012) that controls the tail behavior of the fitted models. A classical approach for bounding the tail behavior of random variable is to bound its Orlicz norm (Van Der Vaart and Wellner, 1996).
Assumption 2**.**
There exist constants and such that for any , dataset , and , we have
[TABLE]
With the above assumption, the following oracle inequality bounds the risk of averaged version of -fold cross-validation. It is a special case of Theorem 4 in the Supplementary Materials, which extends Theorem 3.5 in Lecué and Mitchell (2012). The notation indicates the expectation over random -sample datasets drawn from the probability distribution .
Theorem 2**.**
Let where . Suppose random variables are independent with expectation zero, satisfy , and are independent of . Suppose Assumption 1 holds over the set and Assumption 2 holds. Suppose there exists a function and some such that
[TABLE]
Then there exists an absolute constant and a constant such that for any ,
[TABLE]
As in Theorem 1, the remainder term in Theorem 2 includes a near-parametric term . So as before, adding hyper-parameters to parametric model estimation incurs a similar cost as adding parameters to the parametric model itself and adding hyper-parameters to semi- and non-parametric regression settings is relatively “cheap” and negligible asymptotically.
The differences between Theorems 1 and 2 highlight the tradeoffs made to establish an oracle inequality involving the functional -error. The biggest tradeoff is that Theorem 2 adds Assumption 2. Though we can relax Assumption 2 to hold over datasets in some high-probability set, the difficulty lies in controlling the tail behavior of the fitted models over all . For some model estimation procedures, may grow with if shrinks too quickly with . In this case, the remainder term may not longer shrink at a near-parametric rate. Unfortunately requiring to shrink at an appropriate rate seems to defeat the purpose of cross-validation. So even though Theorem 2 helps us better understand cross-validation, it is limited by this assumption. In addition, the Lipschitz assumption must hold over all in Theorem 2, rather than just the observed covariates. Finally, the oracle inequality in Theorem 2 is no longer sharp since the oracle risk is scaled by for .
4 Penalized regression models
Now we apply our results to analyze penalized regression procedures of the form (2.1). Penalty functions encourage particular characteristics in the fitted models (e.g. smoothness or sparsity) and combining multiple penalty functions results in models that exhibit a combination of the desired characteristics. There is much interest in combining multiple penalty functions, but few methods incorporate more than two penalties due to (a) the concern that models may overfit the data when selection of many penalty parameters is required; and (b) computational issues in optimizing multiple penalty parameters. In this section, we evaluate the validity of concern (a) using the results of Section 3. We see that, contrary to popular wisdom, using split-sample validation to select multiple penalty parameters should not result in a drastic increase to the generalization error of the selected model.
In this section, we consider penalty parameter spaces of the form for . This regime works well for two reasons: one, our rates depend only quite weakly on and ; and two, oracle -values are generally for some (van de Geer, 2000; van de Geer and Muro, 2015; Bühlmann and Van De Geer, 2011). So long as , will contain the optimal penalty parameter. We do not consider settings where shrinks faster than a polynomial rate since the fitted models can be ill-behaved.
In the following sections, we do an in-depth study of additive models of the form
[TABLE]
We first consider parametric additive models (with potentially growing numbers of parameters) fitted with smooth and non-smooth penalties and then nonparametric additive models. We find that the Lipschitz function scales with . Applying Theorems 1 and 2, we find that the near-parametric term in the remainder only grows linearly in . We apply these results to various additive model estimation methods. For instance, in the generalized additive model example, we show that under minimal assumptions, the error from tuning penalty parameters is negligible compared to the error from solving the penalized regression problem with oracle penalty parameters.
4.1 Parametric additive models
Parametric additive models with model parameters have the form
[TABLE]
We denote the training criterion for training data as
[TABLE]
Suppose is the unique minimizer of the expected loss .
4.1.1 Parametric regression with smooth penalties
We begin with the simple case where the penalty functions are smooth. The following lemma states that the fitted models are Lipschitz in the penalty parameter vector. Given matrices and , means that is a positive semi-definite matrix.
Lemma 1**.**
Let where . For a fixed training dataset , suppose for all , has a unique minimizer
[TABLE]
Suppose for all , the parametric class is -Lipschitz in its parameters
[TABLE]
Further suppose for all , and are twice-differentiable with respect to for any fixed . Suppose there exists an such that the Hessian of the penalized training criterion at the minimizer satisfies
[TABLE]
where is a identity matrix. Then for any , Assumption 1 is satisfied over the set with function
[TABLE]
where .
Notice that Lemma 1 requires the training criterion to be strongly convex at its minimizer. This is satisfied in the following example involving multiple ridge penalties. If (4.23) is not satisfied by a penalized regression problem, one can consider a variant of the problem where the penalty functions are replaced with penalty functions for a fixed .
Example 1** (Multiple ridge penalties).**
Let us consider fitting a linear model via ridge regression. If we can group covariates based on the similarity of their effects on the response, e.g. where is a vector of length , we can incorporate this prior information by penalizing each group of covariates differently:
[TABLE]
We tune the penalty parameters over the set via a training/validation split with training and validation sets and , respectively. For all the examples in this manuscript, let .
Via some algebra, we can derive (4.24) in Lemma 1; the details are deferred to the Supplementary Materials. Plugging this result into Corollary 1, we find that the parametric term (3.9) in the remainder is on the order of
[TABLE]
where So we have shown in this example that if the lower bound of shrinks at the polynomial rate , the near-parametric term in the remainder of the oracle inequality grows only linearly in its power .
In the next example, we consider generalized additive models (GAMs) (Hastie and Tibshirani, 1990). Though GAMs are nonparametric models, it is well-known that they are equivalent to solving a finite-dimensional problem (Green and Silverman, 1993; O’sullivan et al., 1986; Buja et al., 1989). By reformulating GAMs as parametric models instead, we can establish oracle inequalities for tuning the penalty parameters via training/validation split. Here we present an outline of the procedure; the details can be found in the Supplementary Materials.
Example 2** (Multiple sobolev penalties).**
To fit a generalized additive model over the domain where , a typical setup is to solve
[TABLE]
where the penalty function is the 2nd-order Sobolev norm. Let for this example. Using properties of the Sobolev penalty, (4.27) can be re-expressed as a finite-dimensional problem with matrices
[TABLE]
Let be the covariates in the training data stacked together. If is invertible, we can derive the closed-form solution for (4.28). From there, we can directly calculate (4.24) in Lemma 1. Plugging this result into Corollary 1, we find that the parametric term in the remainder is on the order of
[TABLE]
where is the spectral norm and is the smallest distance between observations of the th covariates in the training data .
In particular, for , the smoothing spline estimate (4.27) is shown to attain the minimax optimal rate of if the penalty parameters shrink at the rate of (Sadhanala and Tibshirani, 2017; Horowitz et al., 2006). From Corollary 1, we see that the oracle error (3.8) asymptotically dominates the additional error terms incurred from tuning the penalty parameters. Moreover, as long as we choose for any , the model selected via training/validation split will also attain the minimax rate.
4.1.2 Parametric regression with non-smooth penalties
If the penalty functions are non-smooth, similar results do not necessarily hold. Nonetheless we find that for many popular non-smooth penalty functions, such as the lasso (Tibshirani, 1996) and group lasso (Yuan and Lin, 2006), the fitted functions are still smoothly parameterized by almost everywhere. To characterize such problems, we begin with the following definitions from Feng and Simon (2017):
Definition 1**.**
The differentiable space of function at is
[TABLE]
Definition 2**.**
Let be a function with a unique minimizer. is a local optimality space of over if
[TABLE]
Using the definitions above, we can characterize the penalty parameters where the fitted functions are well-behaved.
Condition 1**.**
For every , there exists a ball with nonzero radius centered at such that
- •
For all , the training criterion is twice differentiable with respect to at along directions in the product space
[TABLE]
- •
is a local optimality space for over .
In addition, we need nearly all penalty parameters to be in .
Condition 2**.**
has Lebesgue measure zero, e.g. .
For instance, in the lasso, is the sections of the lasso-path in between the knots. As the knots in the lasso-path are countable, the set outside has measure zero.
Assuming the above conditions hold, the fitted models for non-smooth penalty functions satisfy the same Lipschitz relation as that in Lemma 1.
Lemma 2**.**
Let where . Suppose that for all , satisfies (4.22) over . Suppose for training data , the penalized loss function has a unique minimizer for every . Let be an orthonormal matrix with columns forming a basis for the differentiable space of at . Suppose there exists a constant such that the Hessian of the penalized training criterion at the minimizer taken with respect to the directions in satisfies
[TABLE]
where is the identity matrix.* Suppose Conditions 1 and 2 are satisfied. Then any satisfies Assumption 1 over with defined in (4.24).*
As an example, we consider multiple elastic net penalties where the penalty parameters are tuned by training/validation split and cross-validation.
Example 3** (Multiple elastic nets, training/validation split).**
Suppose we would like to fit a linear model via the elastic net. If the covariates are grouped a priori, we can penalize each group differently using the following objective
[TABLE]
where is a fixed constant. Here we briefly sketch the process for deriving the oracle inequality when the penalty parameters via training/validation split over . Details are given in Supplementary Materials.
First we check that all the conditions are satisfied. For this problem, the differentiable space is the subspace spanned by the non-zero elements in . Since the elastic net solution paths are piecewise linear (Zou and Hastie, 2003), the differentiable space is also a local optimality space. Then using a similar procedure as in Example 1, we find that the parametric term in the remainder of Corollary 1 is on the order of
[TABLE]
where .
We can compare this additional error term to the risk of using an oracle penalty parameter. For the case of a single penalty parameter (), the convergence rate of using an oracle penalty parameter for the elastic net is on the order of (Bunea et al., 2008; Hebiri et al., 2011). If we split the covariates into groups and tune the penalty parameters via training/validation split, the incurred error (4.35) is on a similar order.
Example 4** (Multiple elastic nets, cross-validation).**
Now we establish an oracle inequality for the averaged version of -fold cross-validation using a similar setup as Lecué and Mitchell (2012). Suppose the noise is sub-gaussian and for simplicity, suppose is drawn uniformly from . In order to satisfy the assumptions in Theorem 2, our fitting procedure for entails a thresholding operation similar to that in Lecué and Mitchell (2012). In particular, we fit parameters where the -th element is
[TABLE]
where is the solution to (4.34) and is some fixed constant. We then find the Lipschitz factor in Lemma 3 and bound its Orlicz norm via exponential concentration inequalities. Let be the fitted parameters using the averaged version of -fold cross-validation. By Theorem 2, there is some constant , such that for any
[TABLE]
The above example is similar to the lasso example in Lecué and Mitchell (2012); the major difference is that we consider the case where the penalty parameters are tuned over a continuous range. We are able to do this since Lemma 2 specifies a Lipschitz relation between the fitted functions and the penalty parameters. This result is relevant when is large and must be tuned via a continuous optimization procedure.
4.2 Nonparametric additive models
We now consider nonparametric additive models of the form
[TABLE]
where are penalty functionals and are linear spaces of univariate functions. Let be the minimizer of the generalization error
[TABLE]
We obtain a similar Lipschitz relation in the nonparametric setting to those before.
Lemma 3**.**
Let and . Suppose the penalty functions are twice Gateaux differentiable and convex over . Suppose there is a such that the second Gateaux derivative of the training criterion at for all satisfies
[TABLE]
where is the second Gateaux derivative taken in directions . Let For any , we have
[TABLE]
A simple example that satisfies (4.40) is a penalized regression model where we fit values at each of the observed covariates, e.g. , and penalize this fitted value by a ridge penalty. Note that such a penalty is allowed because the response in the validation set is not used by the training procedure.
Note that since Lemma 3 verifies that Assumption 1 is satisfied over the observed covariates, it is suitable to be used in Theorem 1. However (4.41) is not a strong enough statement to be used for Theorem 2.
5 Simulations
We now present a simulation study of the generalized additive model in Example 2 to understand how the performance changes as the number of penalty parameters increases. Corollary 1 suggests that there are two opposing forces that affect the error of the fitted model. On one hand, (3.9) is linear in so increasing can increase the error. On the other hand, (3.8) decreases for larger model spaces, so increasing may decrease the error. We isolate these two behaviors via two simulation setups.
The data is generated as the sum of univariate functions , where are iid standard Gaussian random variables and is chosen such that the signal to noise ratio is two. is drawn from a uniform distribution over . We fit models by minimizing (4.27). To vary the number of free penalty parameters, we constrain certain to be equal while allowing others to be completely free. (For instance, for a single penalty parameter, we constrain for to be the same value.) The penalty parameters are tuned using a training/validation split.
Simulation 1: The true function is the sum of identical sinusoids for . Since the univariate functions are the same, the oracle risk should be roughly constant as we increase the number of free penalty parameters. The validation loss difference
[TABLE]
should grow linearly in for this simulation setup.
Simulation 2: The true function is the sum of sinusoids with increasing frequency for . Since the Sobolev norms of increase with , we expect that the penalty parameters that attain the oracle risk to be monotonically decreasing, e.g. . As the number of penalty parameters increases, we expect the oracle risk to shrink. If the oracle risk shrinks fast enough, performance of the selected model should improve.
For both simulations, we use . Each simulation was replicated forty times with 200 training and 200 validation samples. We consider free penalty parameters by structuring the penalty parameters in a nested fashion: for each , we constrained to be equal for . Penalty parameters were tuned using nlm in R with initializations at . We did not use grid-search since it is computationally intractable for large numbers of penalty parameters. Multiple initializations were required since the validation loss is not convex in the penalty parameters.
As expected, the validation loss difference increases with the number of penalty parameters in Simulation 1 (Figure 1(a)). To see if our oracle inequalities match the empirical results, we regressed the logarithm of the validation loss difference against the logarithm of the number of penalty parameters. We fit the model using simulation results with at least two penalty parameters as the data is highly skewed for the single penalty parameter case. We estimated a slope of 1.00 (standard error 0.15), which suggests that the validation loss difference grows linearly in the number of penalty parameters. Interestingly, including the single parameter case gives us a slope of 1.45 (standard error 0.14). This suggests that our oracle inequality might not be tight for the single penalty parameter case.
For Simulation 2, the validation loss of the selected model decreases as the number of penalty parameters increases. As suggested in Figure 1(b), the validation loss of the selected model decreases because the oracle risk is decreasing at a faster rate than the rate at which the additional error (3.9) grows.
These simulation results suggest that adding more hyper-parameters can improve model estimates. Having a separate penalty parameter allows GAMs to fit components with differing smoothness. However if we know a priori that the components have the same smoothness, then it is best to use a single penalty parameter.
6 Discussion
In this manuscript, we have characterized the generalization error of split-sample procedures that tune multiple hyper-parameters. If the estimated models are Lipschitz in the hyper-parameters, the generalization error of the selected model is upper bounded by a combination of the oracle risk and a near-parametric term in the number of hyper-parameters. These results show that adding hyper-parameters can decrease the generalization error of the selected model if the oracle risk decreases by a sufficient amount. In the semi- or non-parametric setting, the error incurred from tuning hyper-parameters is dominated by the oracle risk asymptotically; adding hyper-parameters has a negligible effect on the generalization error of the selected model. In the parametric setting, the error incurred from tuning hyper-parameters is on the same order as the oracle error; one should be careful about adding hyper-parameters, though they are not more “costly” than model parameters.
We also showed that many penalized regression examples satisfy the Lipschitz condition so our theoretical results apply. This implies that fitting models with multiple penalties and penalty parameters can be desirable, rather than the usual case with one or two penalty parameters.
One drawback of our theoretical results is that we have assumed that selected hyper-parameter is a global minimizer of the validation loss. Unfortunately this is not achievable in practice since the validation loss is not convex with respect to the hyper-parameters. This problem is exacerbated when there are many hyper-parameters since it is computationally infeasible to perform an exhaustive grid-search. We hope to address this question in future research.
Appendix A Supplementary Materials
We will use the following notation: for functions and and a dataset with samples, we denote the inner product of and at covariates as .
A.1 A single training/validation split
Theorem 1 is a special case of Theorem 3, which applies to general model-estimation procedures. The proof is based on the so-called “basic inequality” below.
Lemma 4**.**
For any , we have
[TABLE]
Proof.
The desired result can be attained by rearranging the definition of
[TABLE]
∎
We are therefore interested in bounding the empirical process term in (A.43). A common approach is to use a measure of complexity of the function class. For a single training/validation split, where we treat the training set as fixed, we only need to consider the complexity of the fitted models from the model-selection procedure
[TABLE]
This model class can be considerably less complex compared to the original function class , such as the special case in Theorem 1 where we suppose is Lipschitz. For this proof, we will use metric entropy as a measure of model class complexity. We recall its definition below.
Definition 3**.**
Let be a function class. Let the covering number be the smallest set of -covers of with respect to the norm . The metric entropy of is defined as the log of the covering number:
[TABLE]
We will bound the empirical process term using the following Lemma, which is a simplification of Corollary 8.3 in van de Geer (2000).
Lemma 5**.**
Suppose are fixed and are independent random variables with mean zero and uniformly sub-gaussian with parameters and . Suppose the model class satisfies and
[TABLE]
There is a constant dependent only on and such that for all satisfying
[TABLE]
we have
[TABLE]
We are now ready to prove the oracle inequality. It uses a standard peeling argument.
Theorem 3**.**
Consider a set of hyper-parameters . Let training data be fixed, as well as the covariates of the validation set . Let the oracle risk be denoted
[TABLE]
Suppose independent random variables for validation set have expectation zero and are uniformly sub-Gaussian with parameter and . Suppose there is a function and constant such that
[TABLE]
Also, suppose is non-increasing in for all .
Then there is a constant only depending on and such that for all satisfying
[TABLE]
we have
[TABLE]
Proof.
Consider any . We will use the simplified notation and . In addition, the following probabilities are all conditional on and but we leave them out for readability.
[TABLE]
where we applied the basic inequality (A.43) in the last line. Each summand in (A.53) can be bounded by splitting the event into the cases where either or is larger. Splitting up the probability and applying Cauchy Schwarz gives us the following bound for (A.51)
[TABLE]
We can bound both (A.55) and (A.56) using Lemma 5. For our choice of in (A.49), there is some constant dependent only on such that (A.55) is bounded above by
[TABLE]
In addition, our choice of from (A.49) and our assumption that is non-increasing implies that the condition in Lemma 5 is satisfied for all simultaneously. Hence for all , we have
[TABLE]
Putting this all together, we have that there is a constant such that (A.51) is bounded above by
[TABLE]
∎
We can apply Theorem 3 to get Theorem 1. Before proceeding, we determine the entropy of when the functions are Lipschitz in the hyper-parameters.
Lemma 6**.**
Let where . Suppose is Lipschitz with function over . Then the entropy of with respect to is
[TABLE]
Proof.
Using a slight variation of the proof for Lemma 2.5 in van de Geer (2000), we can show
[TABLE]
Under the Lipschitz assumption, a -cover for is a -cover for . The covering number for wrt is bounded by the covering number for as follows
[TABLE]
∎
A.1.1 Proof for Theorem 1
Proof.
By Lemma 6, we have
[TABLE]
If we restrict , then for an absolute constant , we have
[TABLE]
Applying Theorem 3, we get our desired result. ∎
A.2 Cross-validation
In order to obtain an oracle inequality for averaged version of cross-validation, we need to extend Theorem 3.5 in Lecué and Mitchell (2012). Let the class of fitted functions for given training data be denoted
[TABLE]
In Lecué and Mitchell (2012), they assume that there is a function that uniformly bounds the size of the class for any training data . However the complexity of depends on training data – for instance, if there is a lot of noise in the training data, the size of can be very high. In our extension, we allow the function to depend on the training data.
Throughout this section, we use Talagrand’s gamma function (Talagrand, 2005) to characterize the size of a function class. We present it below as it will be used later on.
Definition 4**.**
For metric space and , define
[TABLE]
where the infimum is taken over all sequences . (Here, denotes the cardinality of the set .)
We begin with some notation. Suppose we have a measurable space where we observe random variables with values in . Let is a class of measurable functions from ; the model-estimation procedure selects functions from the class . In contrast to the main manuscript, we will consider a very general setting. In particular, the noise is not necessarily independent of . In addition, we consider a general loss function (rather than solely the least squares loss). Define the risk function as the expected loss and suppose the risk function is convex. Let denote the averaged version of cross-validation and denote the minimizer of the risk function over .
In this more general setting, we require a more general version of Assumption 2:
Assumption 3**.**
There exist constants and such that for any and any dataset ,
[TABLE]
Our theorem relies on the basic inequality established in Lemma 3.1 in Lecué and Mitchell (2012). We reproduce it here for convenience. From henceforth, denotes absolute constants, that may not necessarily be the same if they share the same subscript.
Lemma 7**.**
For any constant , we have the following inequality
[TABLE]
where is the empirical probability measure on .
We need to bound the supremum of the second term on the right hand side, which is a shifted empirical process term. Lemma 3.4 in Lecué and Mitchell (2012) already bounds the shifted empirical process term. However to extend their result to our purposes, we restate it to clarify the conditional dependencies. This allows us to introduce two new functions and that will be used later on.
Lemma 8**.**
Let and . Suppose there exists and an increasing function such that ,
[TABLE]
Let . Suppose there exists a function that maps training data to , a function indexed by , and a constant such that for any dataset and any ,
[TABLE]
where .
Then there exists absolute constants such that for all and all ,
[TABLE]
Now that we have established a concentration inequality for the function class , we need to aggregate the results to establish a concentration inequality for the function class . Again, we use Lemma 3.2 in Lecué and Mitchell (2012) but restate it using our new functions and .
Lemma 9**.**
Let . Let be a set of measurable functions. For all and any dataset , suppose for all .
Suppose for any and dataset there exists some absolute constant such that for all and for all ,
[TABLE]
For any , suppose is strictly increasing and its inverse is strictly convex. Let be the convex conjugate of , e.g. for all . Assume there is a such that decreases. For all and , define
[TABLE]
Then there exists a constant that only depends on such that for every ,
[TABLE]
Moreover, assume that is an increasing function in such that . Then there exists a constant that depends only on and such that
[TABLE]
Finally, we are ready to bound the expectation of the shifted empirical process term in (A.71). We accomplish this via a simple chaining argument; we omit its proof as this is a standard application of the chaining argument.
Lemma 10**.**
Consider any . Suppose there exists a constant such that for any , , and , (A.74) holds. Then for any , we have
[TABLE]
Putting Lemmas 7 and 10 together, we have the following result.
Theorem 4**.**
Consider a set of hyper-parameters . Consider a loss function with convex risk function . Let
[TABLE]
Suppose Assumption 3 holds. Suppose there is an and functions and such that for all ,
[TABLE]
where . Moreover, suppose that for all , is a strictly increasing function and is strictly convex. Let the convex conjugate of be denoted . Suppose increases in , , and there exists such that decreases.
Consider any . Then there is a constant such that for every and , the following inequality holds
[TABLE]
where for all .
Of course, this theorem is only useful if we can show that is bounded with high probability. For instance, in an example in the main manuscript, we show that has sub-exponential tails; so the latter term in (A.76) is well-controlled.
We now apply Theorem 4 to prove Theorem 2. Recall that Theorem 2 concerns the squared error loss and only considers model-estimation methods where the estimated functions are Lipschitz in the hyper-parameters. First we need the following lemma that describes the relationship between Lipschitz functions
Lemma 11**.**
Suppose the same conditions as Theorem 4. Suppose Assumptions 1 and 2 hold. Also suppose that . Define for . Then there is an absolute constant such that
[TABLE]
then we also have
[TABLE]
for a constant that only depends on and .
Proof.
Let us first consider a general norm such that for any random variables , we have . Then for all such that , we have
[TABLE]
For , the norm is its own dual norm so (A.83) reduces to
[TABLE]
for an absolute constant .
For , the dual of the norm is . Thus applying Assumption 2 and the fact that , (A.83) reduces to
[TABLE]
∎
Talagrand’s gamma function of a class can be bounded by Dudley’s integral
[TABLE]
(Talagrand, 2005). Combining the above bound with Lemma 11 gives the following lemma.
Lemma 12**.**
Suppose Assumptions 1 and 2 hold. Suppose . Define as before. For , let . Let . Let be defined as before.
Then there exist absolute constants and a constant such that
[TABLE]
Proof.
By definition of , we have Using Lemma 11 and (A.84), we have
[TABLE]
Using very similar logic, we now bound the function. First we bound the diameter of with respect to the norm :
[TABLE]
Thus
[TABLE]
∎
To apply Theorem 4, we need to define and so that (A.75) is satisfied. Based on the lemma above, we see that it suffices to let
[TABLE]
and
[TABLE]
Finally using the results above, we can prove Theorem 2.
Proof for Theorem 2.
We now apply Theorem 4 to our Lipschitz case. From (A.91), we find that Assumption 3 is satisfied. We have defined and so that (A.75) is satisfied for all . Moreover, is strictly increasing and concave in . This implies that is strictly convex. Via algebra, we find that the convex conjugate of is
[TABLE]
Now let us determine as . We have
[TABLE]
So the summation in (A.76) reduces to
[TABLE]
Taking in (A.76) and plugging in (A.101) to Theorem 4, we get our desired result. ∎
A.3 Penalized regression for additive models
We now show that penalized regression problems for additive models satisfy the Lipschitz condition.
A.3.1 Proof for Lemma 1
Proof.
We will use the notation . By the gradient optimality conditions, we have
[TABLE]
After implicitly differentiating with respect to , we have
[TABLE]
From the product rule and chain rule, we can then write the system of equations in (A.103) as
[TABLE]
We can bound the norm of the second term in (A.104) by rearranging (A.102) and using the Cauchy-Schwarz inequality:
[TABLE]
Since is Lipschitz by assumption, then
[TABLE]
Also, by the definition of , we have
[TABLE]
Hence
[TABLE]
Plugging in the results from above and using the assumption that the Hessian of the objective function has a minimum eigenvalue of , we have for all
[TABLE]
Since the norm of the gradient is bounded, must be Lipschitz:
[TABLE]
Finally we combine the above results to get
[TABLE]
∎
A.3.2 Proof for Lemma 2
Before proving Lemma 2, we need to introduce some notation. Let be the line segment connecting and . Let be the 1-dimensional Lebesgue measure in the direction of (so if is a continuous line segment, ; if is composed of multiple line segments , then ).
Before proving the Lipschitz property over all of , we show that the fitted function is Lipschitz over . For convenience, define .
Lemma 13**.**
Suppose that satisfies the Lipschitz condition in Lemma 1. Let be a fixed set of training data. Suppose the penalized loss function has a unique minimizer for every . Let be an orthonormal matrix with columns forming a basis for the differentiable space of at . Suppose there exists a constant such that the Hessian of the penalized training criterion at the minimizer taken with respect to the directions in satisfies
[TABLE]
where is the identity matrix.* Suppose Condition 1 is satisfied by some . Define*
[TABLE]
Then any satisfies (4.24).
Proof.
From Condition 1, every point is the center of a ball with nonzero radius where the differentiable space within is constant.
Now consider any . By (A.118), there must exist a countable set of points where , , and the union of their differentiable neighborhoods cover entirely:
[TABLE]
Consider the intersections of boundaries of the differentiable neighborhoods with the line segment:
[TABLE]
Every point can be expressed as for some . We can order the points in by increasing to get the sequence .
By Condition 1, the differentiable space of the training criterion is constant over since each of these sub-segments are contained in some for . Moreover, the differentiable space over the interior of line segment can be decomposed as the product of differentiable spaces, which we denote as
[TABLE]
By Condition 1, (A.120) is also a local optimality space. Let be an orthonormal basis of for . For each , we can express for all as
[TABLE]
[TABLE]
We can show that the fitted parameters satisfy the Lipschitz condition (A.111) over by using a similar proof as in Lemma 1. The only difference is that the proofs starts with taking directional derivatives along the columns of to establish the KKT conditions. Then for all and , we have
[TABLE]
We can sum these inequalities by the triangle inequality:
[TABLE]
Finally, using the fact that is -Lipschitz, we have by the triangle inequality and Cauchy Schwarz that
[TABLE]
∎
In order to extend the result in Lemma 13 to all of , we need to show that is a set with measure zero.
Lemma 14**.**
Suppose Condition 2. Then where is the Lebesgue measure in and was defined in (A.118).
Proof.
Suppose for contradiction that . If this is the case, then there exists a ball contained in with nonzero radius centered at where and
[TABLE]
Suppose that . We claim that for a sufficiently small radius , we also have
[TABLE]
To see why this claim is true, let us define a monotonically decreasing sequence where for all and . By the monotone convergence theorem,
[TABLE]
By the definition of limits, there is some sufficiently large such that for , we have
[TABLE]
Given our ball is non-empty, there exist points where
[TABLE]
For any , the line
[TABLE]
has
[TABLE]
As the lines do not intersect for , then
[TABLE]
Thus
[TABLE]
However this is a contradiction of our assumption that . ∎
Finally, combining Lemmas 13 and 14, we can show that the Lipschitz condition is satisfied over all of .
Proof for Lemma 2.
Since we already showed Lemma 13, it suffices to show that the Lipschitz condition is satisfied for any . Lemma 14 states that , which means that there exists a sequence such that . As is continuous and we have assumed that there exists a unique minimizer of for all , then is continuous in over all . As is also continuous in , then for any , we have
[TABLE]
where is defined in (A.122). ∎
A.3.3 Proof for Lemma 3
Proof.
Let . For all , let
[TABLE]
For notational convenience, let . Consider the optimization problem
[TABLE]
By the gradient optimality conditions, we have
[TABLE]
Implicit differentiation with respect to gives us
[TABLE]
From the product rule and chain rule, we can write the system of equations from (A.137) as
[TABLE]
where is the loss in (A.136).
We now bound the second term in (A.138). From (A.136) and Cauchy Schwarz, we have for all
[TABLE]
From the definition of , we know that . By definition of and , we also have
[TABLE]
Hence
[TABLE]
By (4.40), we know . So for all ,
[TABLE]
By the mean value inequality and Cauchy Schwarz, we have
[TABLE]
By construction, . So we obtain our desired result in (4.41). ∎
A.4 Examples: detailed derivations
Example 1 (Multiple ridge penalties) Here we present the details for deriving (4.24) for Example 1. The additive components are linear functions that are -Lipschitz where . Then by Lemma 1, the fitted function satisfy Assumption 1 over with
[TABLE]
where is defined in Example 1 of the main manuscript.
Example 2 (Multiple sobolev penalties) here we present the details for deriving (4.24) for Example 2 Since the solution to (4.27) must be the sum of natural cubic splines (Buja et al., 1989), we can parameterize the space using a Reproducing Kernel Hilbert Space with inner product
[TABLE]
and the reproducing kernel
[TABLE]
(Heckman et al., 2012). Then one can instead solve for (4.27) over the functions of the form
[TABLE]
where the functions are split into a linear component and an orthogonal non-linear component
[TABLE]
For notational simplicity, we will also denote . We will also write
[TABLE]
Using this finite-dimensional representation, we find that
[TABLE]
where the matrix has elements Since any with non-zero will have a positive Sobolev penalty, then the matrix must be positive definite. Using the formulation above, we re-express (4.27) as the finite-dimensional problem
[TABLE]
where . In order to make the fitted functions identifiable, we add the usual constraint that for all . We also assume that is nonsingular to ensure that there is a unique .
The KKT conditions then gives us
[TABLE]
where , is the identity matrix, and .
To apply Theorem 1, we need to characterize how varies with . Since we have the closed form solution to (A.153), we use it to directly bound the Lipschitz factor . From Green and Silverman (1993), we know that the value of the cubic on the interval can be defined using its values and second derivatives at the ends of the interval. Let . Then the value of the cubic
[TABLE]
Let be the vector of second derivatives of for observations in the training data. Since the fitted functions must be natural cubic splines, and have a linear relationship:
[TABLE]
where the matrix is a banded diagonally dominant matrix and is a banded negative-semi-definite matrix that depend on the covariates in the training data. For the definitions of and , refer to Green and Silverman (1993). Let be the smallest distance between observations of the th covariates in the training data . Then using the Gershgorin circle theorem (gershgorin1931uber), one can show that all the eigenvalues of are larger than and all the eigenvalues of have magnitudes no greater than . Thus using (A.154) and (A.155), we have that
[TABLE]
for some absolute constant . To bound the second term on the right hand side, we know from (A.153) that
[TABLE]
if . Otherwise . Thus
[TABLE]
The eigenvalues of are bounded above by the largest row sum, which is no more than (assuming all training covariates are between 0 and 1). Putting the results above together, we have
[TABLE]
Also, we have from (A.152) that
[TABLE]
Finally we can conclude that
[TABLE]
By triangle inequality, we get the Lipschitz factor for the fitted model by summing up (A.164) for . We find that the Lipschitz factor in (4.24) is
[TABLE]
Example 3 (Multiple elastic nets, training-validation split) Here we check that all the conditions for Lemma 2 are satisfied.
First we check Condition 1. Since the absolute value function is twice-continuously differentiable everywhere except at zero, the directional derivatives of at only exist along directions spanned by the columns of . Thus the penalized training loss is twice differentiable with respect to the directions in
[TABLE]
Moreover, the elastic net solution paths are piecewise linear (Zou and Hastie, 2003). This implies that the nonzero indices of the elastic net estimates stay locally constant for almost every ; so (A.166) is also a local optimality space for . In addition, this implies that Condition 2 is satisfied.
We also check that the Hessian of the penalized training loss has a minimum eigenvalue bounded away from zero. Consider the following orthogonal basis of (A.166) at : where
[TABLE]
The Hessian matrix of with respect to directions is
[TABLE]
where and is the identity matrix with length equal to the number of nonzero elements in . Since the first summand is positive semi-definite and , (A.168) has a minimum eigenvalue of .
Example 4 (Multiple elastic nets, cross-validation) Here we present details for establishing an oracle inequality when multiple elastic net penalties are tuned via the averaged version of -fold cross-validation. First we check the conditions in Theorem 2 are satisfied. In the problem setup, is a log-concave vector and for some constant . Using a similar procedure as Lecué and Mitchell (2012), we can then show that (3.14) and (3.15) in Assumption 2 are satisfied with .
Next we find the Lipschitz factor. We can upper bound the Lipschitz factor of the thresholded model with the Lipschitz factor of the un-thresholded model. So Assumption 1 is satisfied over with
[TABLE]
Finally, to apply Theorem 2, we must find a bound for (3.16). Let . Using the fact that is a linear function of , which is a sub-exponential random variable, we have that
[TABLE]
for constants . Plugging in this bound to Theorem 2 gives us our desired result.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arlot et al. [2010] Sylvain Arlot, Alain Celisse, et al. A survey of cross-validation procedures for model selection. Statistics surveys , 4:40–79, 2010.
- 2Györfi et al. [2006] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression . Springer Science & Business Media, 2006.
- 3Van Der Laan and Dudoit [2003] Mark J Van Der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. 2003.
- 4van der Laan et al. [2004] Mark J van der Laan, Sandrine Dudoit, and Sunduz Keles. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology , 3(1):1–23, 2004.
- 5Lecué and Mitchell [2012] Guillaume Lecué and Charles Mitchell. Oracle inequalities for cross-validation type procedures. Electronic Journal of Statistics , 6:1803–1837, 2012.
- 6Bengio [2000] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation , 12(8):1889–1900, 2000.
- 7Zou and Hastie [2003] Hui Zou and Trevor Hastie. Regression shrinkage and selection via the elastic net. Journal of the Royal Statistical Society: Series B. v 67 , pages 301–320, 2003.
- 8Simon et al. [2013] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics , 22(2):231–245, 2013.
