An analysis of the cost of hyper-parameter selection via split-sample   validation, with applications to penalized regression

Jean Feng; Noah Simon

arXiv:1903.12297·stat.ML·April 1, 2019

An analysis of the cost of hyper-parameter selection via split-sample validation, with applications to penalized regression

Jean Feng, Noah Simon

PDF

TL;DR

This paper investigates how the generalization error grows with the number of hyper-parameters in model selection, providing finite-sample bounds and analyzing penalized regression with multiple penalties.

Contribution

It establishes finite-sample oracle inequalities for hyper-parameter tuning via split-sample validation and cross-validation, especially for penalized regression with multiple penalties.

Findings

01

Error from hyper-parameter tuning shrinks at nearly parametric rate for smooth models.

02

Adding hyper-parameters is akin to adding model parameters in parametric cases.

03

Lipschitz continuity of penalized models supports multiple penalty parameters.

Abstract

In the regression setting, given a set of hyper-parameters, a model-estimation procedure constructs a model from training data. The optimal hyper-parameters that minimize generalization error of the model are usually unknown. In practice they are often estimated using split-sample validation. Up to now, there is an open question regarding how the generalization error of the selected model grows with the number of hyper-parameters to be estimated. To answer this question, we establish finite-sample oracle inequalities for selection based on a single training/test split and based on cross-validation. We show that if the model-estimation procedures are smoothly parameterized by the hyper-parameters, the error incurred from tuning hyper-parameters shrinks at nearly a parametric rate. Hence for semi- and non-parametric model-estimation procedures with a fixed number of hyper-parameters, this…

Equations348

\overset{g}{^} (λ ∣ T) = g \in G arg min (x_{i}, y_{i}) \in T \sum (y_{i} - g (x_{i}))^{2} + j = 1 \sum J λ_{j} P_{j} (g),

\overset{g}{^} (λ ∣ T) = g \in G arg min (x_{i}, y_{i}) \in T \sum (y_{i} - g (x_{i}))^{2} + j = 1 \sum J λ_{j} P_{j} (g),

E [(y - \overset{g}{^} (\hat{λ} ∣ T) (x))^{2}] \leq (1 + a) Oracle risk λ \in Λ in f E [(y - \overset{g}{^} (λ ∣ T) (x))^{2}] + δ (J, n)

E [(y - \overset{g}{^} (\hat{λ} ∣ T) (x))^{2}] \leq (1 + a) Oracle risk λ \in Λ in f E [(y - \overset{g}{^} (λ ∣ T) (x))^{2}] + δ (J, n)

\overset{g}{^}^{(n_{T})} (λ^{(1)} ∣ D^{(n_{T})}) (x) - \overset{g}{^}^{(n_{T})} (λ^{(1)} ∣ D^{(n_{T})}) (x) \leq C_{Λ} (x ∣ D^{(n_{T})}) ∥ λ^{(1)} - λ^{(2)} ∥_{2} .

\overset{g}{^}^{(n_{T})} (λ^{(1)} ∣ D^{(n_{T})}) (x) - \overset{g}{^}^{(n_{T})} (λ^{(1)} ∣ D^{(n_{T})}) (x) \leq C_{Λ} (x ∣ D^{(n_{T})}) ∥ λ^{(1)} - λ^{(2)} ∥_{2} .

\hat{λ} \in λ \in Λ arg min \frac{1}{2} y - \overset{g}{^}^{(n_{T})} (λ ∣ T)_{V}^{2}

\hat{λ} \in λ \in Λ arg min \frac{1}{2} y - \overset{g}{^}^{(n_{T})} (λ ∣ T)_{V}^{2}

i : (x_{i}, y_{i}) \in V max B^{2} (E e^{∣ ϵ_{i} ∣^{2} / B^{2}} - 1) \leq b^{2} .

i : (x_{i}, y_{i}) \in V max B^{2} (E e^{∣ ϵ_{i} ∣^{2} / B^{2}} - 1) \leq b^{2} .

\tilde{R} (X_{V} ∣ T) = λ \in Λ arg min g^{*} - \overset{g}{^}^{(n_{T})} (λ ∣ T)_{V}^{2} .

\tilde{R} (X_{V} ∣ T) = λ \in Λ arg min g^{*} - \overset{g}{^}^{(n_{T})} (λ ∣ T)_{V}^{2} .

δ^{2} \geq c \frac{J lo g ( ∥ C _{Λ} ( \cdot ∣ T ) ∥ _{V} Δ _{Λ} n + 1 )}{n _{V}} \lor \frac{J lo g ( ∥ C _{Λ} ( \cdot ∣ T ) ∥ _{V} Δ _{Λ} n + 1 )}{n _{V}} \tilde{R} (X_{V} ∣ T)

δ^{2} \geq c \frac{J lo g ( ∥ C _{Λ} ( \cdot ∣ T ) ∥ _{V} Δ _{Λ} n + 1 )}{n _{V}} \lor \frac{J lo g ( ∥ C _{Λ} ( \cdot ∣ T ) ∥ _{V} Δ _{Λ} n + 1 )}{n _{V}} \tilde{R} (X_{V} ∣ T)

P r (g^{*} - \overset{g}{^}^{(n_{T})} (\hat{λ} ∣ T)_{V}^{2} - \tilde{R} (X_{V} ∣ T) \geq δ^{2} T, X_{V})

P r (g^{*} - \overset{g}{^}^{(n_{T})} (\hat{λ} ∣ T)_{V}^{2} - \tilde{R} (X_{V} ∣ T) \geq δ^{2} T, X_{V})

g^{*} - \overset{g}{^}^{(n_{T})} (\hat{λ} ∣ T)_{V}^{2}

g^{*} - \overset{g}{^}^{(n_{T})} (\hat{λ} ∣ T)_{V}^{2}

+ O_{p} (\frac{J lo g ( n ∥ C _{Λ} ∥ _{V} Δ _{Λ} )}{n _{V}})

+ O_{p} \frac{J lo g ( n ∥ C _{Λ} ∥ _{V} Δ _{Λ} )}{n _{V}} λ \in Λ min g^{*} - \overset{g}{^}^{(n_{T})} (λ ∣ T)_{V}^{2} .

G (T) = {\overset{g}{^}^{(n_{T})} (λ ∣ T) : λ \in Λ} .

G (T) = {\overset{g}{^}^{(n_{T})} (λ ∣ T) : λ \in Λ} .

\hat{λ}

\hat{λ}

\overset{g}{ˉ} (D^{(n)}) = \frac{1}{K} k = 1 \sum K \overset{g}{^}^{(n_{T})} (\hat{λ} D_{- k}^{(n_{T})}) .

\overset{g}{ˉ} (D^{(n)}) = \frac{1}{K} k = 1 \sum K \overset{g}{^}^{(n_{T})} (\hat{λ} D_{- k}^{(n_{T})}) .

(y - \overset{g}{^}^{(n_{T})} (λ ∣ D^{(n_{T})}))^{2} - (y - g^{*})^{2}_{L_{ψ_{1}}}

(y - \overset{g}{^}^{(n_{T})} (λ ∣ D^{(n_{T})}))^{2} - (y - g^{*})^{2}_{L_{ψ_{1}}}

(y - \overset{g}{^}^{(n_{T})} (λ ∣ D^{(n_{T})}))^{2} - (y - g^{*})^{2}_{L_{2}}

\tilde{h} (n_{T}) \geq 1 + k = 1 \sum \infty k Pr (∥ C_{Λ} (\cdot ∣ D^{(n_{T})}) ∥_{L_{ψ_{2}}} \geq 2^{k} σ_{0}) .

\tilde{h} (n_{T}) \geq 1 + k = 1 \sum \infty k Pr (∥ C_{Λ} (\cdot ∣ D^{(n_{T})}) ∥_{L_{ψ_{2}}} \geq 2^{k} σ_{0}) .

E_{D^{(n)}} (∥ \overset{g}{ˉ} (D^{(n)}) - g^{*} ∥_{L_{2}}^{2}) \leq (1 + a) λ \in Λ in f [E_{D^{(n_{T})}} (∥ \overset{g}{^} (λ ∣ D^{(n_{T})}) - g^{*} ∥_{L_{2}}^{2})] + c_{1} (\frac{1 + a}{a})^{2} \frac{J lo g n _{V}}{n _{V}} K_{0} [lo g (Δ_{Λ} c_{K_{0}, b} n σ_{0} + 1) + 1] \tilde{h} (n_{T}) .

E_{D^{(n)}} (∥ \overset{g}{ˉ} (D^{(n)}) - g^{*} ∥_{L_{2}}^{2}) \leq (1 + a) λ \in Λ in f [E_{D^{(n_{T})}} (∥ \overset{g}{^} (λ ∣ D^{(n_{T})}) - g^{*} ∥_{L_{2}}^{2})] + c_{1} (\frac{1 + a}{a})^{2} \frac{J lo g n _{V}}{n _{V}} K_{0} [lo g (Δ_{Λ} c_{K_{0}, b} n σ_{0} + 1) + 1] \tilde{h} (n_{T}) .

g (x^{(1)}, ..., x^{(J)}) = j = 1 \sum J g_{j} (x^{(j)}) .

g (x^{(1)}, ..., x^{(J)}) = j = 1 \sum J g_{j} (x^{(j)}) .

g (θ) (x) = j = 1 \sum J g_{j} (θ^{(j)}) (x^{(j)}) .

g (θ) (x) = j = 1 \sum J g_{j} (θ^{(j)}) (x^{(j)}) .

L_{T} (θ, λ) : = \frac{1}{2} ∥ y - g (θ) ∥_{T}^{2} + j = 1 \sum J λ_{j} P_{j} (θ^{(j)}) .

L_{T} (θ, λ) : = \frac{1}{2} ∥ y - g (θ) ∥_{T}^{2} + j = 1 \sum J λ_{j} P_{j} (θ^{(j)}) .

{\hat{θ}^{(j)} (λ ∣ T)}_{j = 1}^{J} = θ \in R^{p} arg min L_{T} (θ, λ) .

{\hat{θ}^{(j)} (λ ∣ T)}_{j = 1}^{J} = θ \in R^{p} arg min L_{T} (θ, λ) .

g_{j} (θ^{(1)}) (x^{(j)}) - g_{j} (θ^{(2)}) (x^{(j)}) \leq ℓ_{j} (x^{(j)}) ∥ θ^{(1)} - θ^{(2)} ∥_{2} \forall x^{(j)} \in X^{(j)} .

g_{j} (θ^{(1)}) (x^{(j)}) - g_{j} (θ^{(2)}) (x^{(j)}) \leq ℓ_{j} (x^{(j)}) ∥ θ^{(1)} - θ^{(2)} ∥_{2} \forall x^{(j)} \in X^{(j)} .

\nabla_{θ}^{2} L_{T} (θ, λ)_{θ = \hat{θ} (λ ∣ T)} ⪰ m (T) I \forall λ \in Λ,

\nabla_{θ}^{2} L_{T} (θ, λ)_{θ = \hat{θ} (λ ∣ T)} ⪰ m (T) I \forall λ \in Λ,

C_{Λ} (x ∣ T) = \frac{1}{m ( T ) λ _{min}} (∥ ϵ ∥_{T}^{2} + 2 C_{Λ}^{*}) (j = 1 \sum J ∥ ℓ_{j} ∥_{T}^{2} ℓ_{j}^{2} (x^{(j)}))

C_{Λ} (x ∣ T) = \frac{1}{m ( T ) λ _{min}} (∥ ϵ ∥_{T}^{2} + 2 C_{Λ}^{*}) (j = 1 \sum J ∥ ℓ_{j} ∥_{T}^{2} ℓ_{j}^{2} (x^{(j)}))

L_{T} (θ, λ) : = \frac{1}{2} y - j = 1 \sum J x^{(j)} θ^{(j)}_{T}^{2} + j = 1 \sum J \frac{λ _{j}}{2} ∥ θ^{(j)} ∥_{2}^{2} .

L_{T} (θ, λ) : = \frac{1}{2} y - j = 1 \sum J x^{(j)} θ^{(j)}_{T}^{2} + j = 1 \sum J \frac{λ _{j}}{2} ∥ θ^{(j)} ∥_{2}^{2} .

\frac{J t _{m i n}}{n _{V}} lo g C_{T}^{*} n j = 1 \sum J \frac{1}{n _{T}} (x_{i}, y_{i}) \in T \sum ∥ x_{i}^{(j)} ∥_{2}^{2} \frac{1}{n _{V}} (x_{i}, y_{i}) \in V \sum ∥ x_{i}^{(j)} ∥_{2}^{2}

\frac{J t _{m i n}}{n _{V}} lo g C_{T}^{*} n j = 1 \sum J \frac{1}{n _{T}} (x_{i}, y_{i}) \in T \sum ∥ x_{i}^{(j)} ∥_{2}^{2} \frac{1}{n _{V}} (x_{i}, y_{i}) \in V \sum ∥ x_{i}^{(j)} ∥_{2}^{2}

α_{0} \in R, g_{j} arg min \frac{1}{2} i \in D^{(n_{T})} \sum (y_{i} - α_{0} - j = 1 \sum J g_{j} (x_{ij}))^{2} + j = 1 \sum J λ_{j} \int_{X} (g_{j}^{^{''}} (x_{j}))^{2} d x_{j}

α_{0} \in R, g_{j} arg min \frac{1}{2} i \in D^{(n_{T})} \sum (y_{i} - α_{0} - j = 1 \sum J g_{j} (x_{ij}))^{2} + j = 1 \sum J λ_{j} \int_{X} (g_{j}^{^{''}} (x_{j}))^{2} d x_{j}

α_{0}, α_{1}, θ arg min \frac{1}{2} y - α_{0} 1 - x α_{1} - j = 1 \sum j K_{j} θ^{(j)}_{T}^{2} + \frac{1}{2} j = 1 \sum J λ_{j} θ^{(j) ⊤} K_{j} θ^{(j)} .

α_{0}, α_{1}, θ arg min \frac{1}{2} y - α_{0} 1 - x α_{1} - j = 1 \sum j K_{j} θ^{(j)}_{T}^{2} + \frac{1}{2} j = 1 \sum J λ_{j} θ^{(j) ⊤} K_{j} θ^{(j)} .

\frac{J t _{m i n}}{n _{V}} lo g (n J ∥ y ∥_{T} (J (X_{T}^{⊤} X_{T})^{- 1} X_{T}^{⊤}_{2} + j = 1 \sum J h_{j}^{- 2} (T)))

\frac{J t _{m i n}}{n _{V}} lo g (n J ∥ y ∥_{T} (J (X_{T}^{⊤} X_{T})^{- 1} X_{T}^{⊤}_{2} + j = 1 \sum J h_{j}^{- 2} (T)))

Ω^{f} (θ) = {β ϵ \to 0 lim \frac{f ( θ + ϵ β ) - f ( θ )}{ϵ} exists} .

Ω^{f} (θ) = {β ϵ \to 0 lim \frac{f ( θ + ϵ β ) - f ( θ )}{ϵ} exists} .

θ \in R^{p} arg min f (θ, λ) = θ \in S arg min f (θ, λ) \forall λ \in W .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

An analysis of the cost of hyper-parameter selection via split-

sample validation, with applications to penalized regression

Jean Feng, Noah Simon

Department of Biostatistics, University of Washington

Abstract: In the regression setting, given a set of hyper-parameters, a model-estimation procedure constructs a model from training data. The optimal hyper-parameters that minimize generalization error of the model are usually unknown. In practice they are often estimated using split-sample validation. Up to now, there is an open question regarding how the generalization error of the selected model grows with the number of hyper-parameters to be estimated. To answer this question, we establish finite-sample oracle inequalities for selection based on a single training/test split and based on cross-validation. We show that if the model-estimation procedures are smoothly parameterized by the hyper-parameters, the error incurred from tuning hyper-parameters shrinks at nearly a parametric rate. Hence for semi- and non-parametric model-estimation procedures with a fixed number of hyper-parameters, this additional error is negligible. For parametric model-estimation procedures, adding a hyper-parameter is roughly equivalent to adding a parameter to the model itself. In addition, we specialize these ideas for penalized regression problems with multiple penalty parameters. We establish that the fitted models are Lipschitz in the penalty parameters and thus our oracle inequalities apply. This result encourages development of regularization methods with many penalty parameters.

Key words and phrases: Cross-validation, Regression, Regularization.

2 Introduction

Per the usual regression framework, suppose we observe response $y\in\mathbb{R}$ and predictors $\boldsymbol{x}\in\mathbb{R}^{p}$ . Suppose $y$ is generated by a true model $g^{*}$ plus random error $\epsilon$ with mean zero, e.g. $y=g^{*}(\boldsymbol{x})+\epsilon$ . Our goal is to estimate $g^{*}$ . Many model-estimation procedures can be formulated as selecting a model from some function class $\mathcal{G}$ given training data $T$ and $J$ -dimensional hyper-parameter vector $\boldsymbol{\lambda}$ . For example, in penalized regression problems, the fitted model can be expressed as the minimizer of the penalized training criterion

[TABLE]

where $P_{j}$ are penalty functions and $\lambda_{j}$ are penalty parameters that serve as hyper-parameters of the model-estimation procedure.

If $\Lambda$ is a set of possible hyper-parameters, the goal is to find a penalty parameter $\boldsymbol{\lambda}\in\Lambda$ that minimizes the expected generalization error $\mathbb{E}\left[\left(y-\hat{g}(\boldsymbol{\lambda}|T)(\boldsymbol{x})\right)^{2}\right].$ Typically one uses a sample-splitting procedure where models are trained on a random partition of the observed data and evaluated on the remaining data. One then chooses the hyper-parameter $\hat{\boldsymbol{\lambda}}$ that minimize the error on this validation set. For a more complete review of cross-validation, refer to Arlot et al. (2010).

The performance of split-sample validation procedures is typically characterized by an oracle inequality that bounds the generalization error of the expected model selected from the validation set procedure. For $\Lambda$ that are finite, oracle inequalities have been established for a single training/validation split (Györfi et al., 2006) and a general cross-validation framework (Van Der Laan and Dudoit, 2003; van der Laan et al., 2004). To handle $\Lambda$ over a continuous range, one can use entropy-based approaches (Lecué and Mitchell, 2012).

The goal of this paper is to characterize the performance of models when the hyper-parameters are tuned by some split-sample validation procedure. We are particularly interested in an open question raised in Bengio (2000): what is the “amount of overfitting… when too many hyper-parameters are optimized”? In addition, how many hyper-parameters is “too many”? In this paper we show that actually a large number of hyper-parameters can be tuned without overfitting. In fact, if an oracle estimator converges at rate $R(n)$ , then the number of hyper parameters $J$ can grow at roughly a rate of $J=O_{p}(nR(n))$ up to log terms without affecting the convergence rate. In practice, for penalized regression, this means that one can propose and tune over much more complex models than are currently often used.

To show these results, we prove that finite-sample oracle inequalities of the form

[TABLE]

are satisfied with high probability for some constant $a\geq 0$ and remainder $\delta(J,n)$ that depends on the number of tuned hyper-parameters $J$ and the number of samples $n$ . Under the assumption that the model -estimation procedure is Lipschitz in the hyper-parameters, we find that $\delta$ scales linearly in $J$ . For parametric model-estimation procedures, the additional error from tuning hyper-parameters is roughly $O_{p}(J/n)$ , which is similar to the typical parametric model-estimation rate $O_{p}(p/n)$ where the model parameters are not regularized. For semi- and non-parametric model-estimation procedures, this error is generally dominated by the oracle risk so we can actually grow the number of hyper-parameters without affecting the asymptotic convergence rate.

In addition, we specialize our results to penalized regression models of the form (2.1). The models in our examples are Lipschitz so that our oracle inequalities apply. This suggests that multiple penalty parameters may improve the model estimation and that the recent interest in combining penalty functions (e.g. elastic net and sparse group lasso (Zou and Hastie, 2003; Simon et al., 2013)) may have artificially restricted themselves to two-way combinations.

During our literature search, we found few theoretical results relating the number of hyper-parameters to the generalization error of the selected model. Much of the previous work only considered tuning a one-dimensional hyper-parameter over a finite $\Lambda$ , proving asymptotic optimality (van der Laan et al., 2004) and finite-sample oracle inequalities (Van Der Laan and Dudoit, 2003; Györfi et al., 2006). Others have addressed split-sample validation for specific penalized regression problems with a single penalty parameter, such as linear model selection (Li, 1987; Shao, 1997; Golub et al., 1979; Chetverikov and Liao, 2016; Chatterjee and Jafarov, 2015). Only the results in Lecué and Mitchell (2012) are relevant to answering our question of interest. A potential reason for this dearth of literature is that, historically, tuning multiple hyper-parameters was computationally difficult. However there have been many recent proposals that address this computational hurdle (Bengio, 2000; Foo et al., 2008; Snoek et al., 2012).

Section 3 presents oracle inequalities for sample-splitting procedures to understand how the number of hyper-parameters affects the model error. Section 4 applies these results to penalized regression models. Section 5 provides a simulation study to support our theoretical results. Oracle inequalities for general model-estimation procedures and proofs are given in the Supplementary Materials.

3 Oracle Inequalities

Here we establish oracle inequalities for models where the hyper-parameters are tuned by a single training/validation split and cross-validation. We are interested in studying model-estimation procedures that vary smoothly in their hyper-parameters; such procedures tend to be easier to use and therefore tend to be more popular.

Let $D^{(n)}$ denote a dataset with $n$ samples. Given dataset training data $D^{(m)}$ , let $\hat{g}^{(m)}(\boldsymbol{\lambda}|D^{(m)})$ be some model-estimation procedure that maps hyper-parameter $\boldsymbol{\lambda}$ to a function in $\mathcal{G}$ . We assume the following Lipschitz-like assumption on the model-estimation procedure. In particular, we suppose that for any $\boldsymbol{x}$ , the predicted value $\hat{g}^{(m)}(\boldsymbol{\lambda}|D^{(m)})(\boldsymbol{x})$ is Lipschitz in $\boldsymbol{\lambda}$ :

Assumption 1.

Suppose there is a set $\mathcal{X}^{(L)}\subseteq\mathcal{X}$ such that for any $n_{T}\in\mathbb{N}$ and dataset $D^{(n_{T})}$ , there is a function $C_{\Lambda}(\boldsymbol{x}|D^{(n_{T})}):\mathcal{X}^{(L)}\mapsto\mathbb{R}^{+}$ such that for any $\boldsymbol{x}\in\mathcal{X}^{(L)}$ , we have for all $\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\in\Lambda$

[TABLE]

We provide examples of penalized regression models that satisfy this assumption in Section 4.

3.1 A Single Training/Validation Split

In the training/validation split procedure, the dataset $D^{(n)}$ is randomly partitioned into a training set $T=(X_{T},Y_{T})$ and validation set $V=(X_{V},Y_{V})$ with $n_{T}$ and $n_{V}$ observations, respectively. The selected hyper-parameter $\hat{\boldsymbol{\lambda}}$ is a minimizer of the validation loss

[TABLE]

where $\|h\|^{2}_{V}\coloneqq\frac{1}{n_{V}}\sum_{(x_{i},y_{i})\in V}h^{2}(x_{i},y_{i})$ for function $h$ .

We now present a finite-sample oracle inequality for the single training/validation split assuming Assumption 1 holds. Our oracle inequality is sharp, i.e. $a=0$ in (2.2), unlike most other work (Györfi et al., 2006; Lecué and Mitchell, 2012; Van Der Laan and Dudoit, 2003). Note that the result below is a special case of Theorem 3 in Supplementary Materials A.1, which applies to general model-estimation procedures.

Theorem 1.

Let $\Lambda=[\lambda_{\min},\lambda_{\max}]^{J}$ where $\Delta_{\lambda}=\lambda_{\max}-\lambda_{\min}\geq 0$ . Suppose random variables $\epsilon_{i}$ from the validation set $V$ are independent with expectation zero and are uniformly sub-Gaussian with parameters $b$ and $B$ :

[TABLE]

Let the oracle risk be denoted

[TABLE]

Suppose Assumption 1 is satisfied over the set $X_{V}$ . Then there is a constant $c>0$ only depending on $b$ and $B$ such that for all $\delta$ satisfying

[TABLE]

we have

[TABLE]

Theorem 1 states that with high probability, the excess risk, e.g. the error incurred during the hyper-parameter selection process, is no more than $\delta^{2}$ . As seen in (3.6), $\delta^{2}$ is the maximum of two terms: a near-parametric term and the geometric mean of the near-parametric term and the oracle risk. To see this more clearly, we express Theorem 1 using asymptotic notation.

Corollary 1.

Under the assumptions given in Theorem 1, we have

[TABLE]

Corollary 1 show that the risk of the selected model is bounded by the oracle risk, the near-parameteric term (3.9), and the geometric mean of the two values (3.10). We refer to (3.9) as near-parametric because the error term in (un-regularized) parametric regression models is typically $O_{p}(J/n)$ , where $J$ is the parameter dimension and $n$ is the number of training samples. Analogously, (3.9) is $O_{p}(J/n_{V})$ modulo a $\log n$ term in the numerator. The geometric mean (3.10) can be thought of as a consequence of tuning hyper-parameters over

[TABLE]

As $\mathcal{G}(T)$ does not (or is very unlikely to) contain the true model $g^{*}$ , tuning the hyper-parameters via training/validation split is tuning over a the misspecified model class. The geometric mean takes into account this misspecification error.

In the semi- and non-parametric regression settings, the oracle error usually shrinks at a rate of $O_{p}(n_{T}^{-\omega})$ where $\omega\in(0,1)$ . If the number of hyper-parameters is fixed and $n$ is large, the oracle risk will tend to dominate the upper bound. Hence for such problems, we can actually let the number of hyper-parameters grow – the asymptotic convergence rate of the upper bound will be unchanged as long as $J$ grows no faster than $O_{p}\left(\frac{n_{V}n_{T}^{-\omega}}{\log(n\|C_{\Lambda}\|_{V}\Delta_{\Lambda})}\right).$

3.2 Cross-Validation

Now we give an oracle inequality for $K$ -fold cross-validation. Previously, the oracle inequality was with respect to the $L_{2}$ -norm over the validation covariates. Now we give our result with respect to the functional $L_{2}$ -norm. We suppose our dataset is composed of independent identically distributed observations $(X,y)$ where $X$ is independent of $\epsilon$ . The functional $L_{2}$ -norm is defined as $\left\|h\right\|^{2}_{L_{2}}=\int\left|h(x)\right|^{2}d\mu(x)$ .

For $K$ -fold cross-validation, we randomly partition the dataset $D^{(n)}$ into $K$ sets, which we assume to have equal size for simplicity. Partition $k$ will be denoted $D_{k}^{(n_{V})}$ and its complement will be denoted $D_{-k}^{(n_{T})}=D^{(n)}\setminus D_{k}^{(n_{V})}$ . We train our model using $D_{-k}^{(n_{T})}$ for $k=1,...,K$ and select the hyper-parameter that minimizes the average validation loss

[TABLE]

In traditional cross-validation, the final model is retrained on all the data with $\hat{\boldsymbol{\lambda}}$ . However bounding the generalization error of the retrained model requires additional regularity assumptions (Lecué and Mitchell, 2012). We consider the “averaged version of $K$ -fold cross-validation” instead

[TABLE]

To bound the generalization error of (3.13), we require an assumption in Lecué and Mitchell (2012) that controls the tail behavior of the fitted models. A classical approach for bounding the tail behavior of random variable $X$ is to bound its Orlicz norm $\|X\|_{L_{\psi_{1}}}=\inf\{C>0:\mathbb{E}\exp(|X|/C)-1\leq 1\}$ (Van Der Vaart and Wellner, 1996).

Assumption 2.

There exist constants $K_{0},K_{1}\geq 0$ and $\kappa\geq 1$ such that for any $n_{T}\in\mathbb{N}$ , dataset $D^{(n_{T})}$ , and $\boldsymbol{\lambda}\in\Lambda$ , we have

[TABLE]

With the above assumption, the following oracle inequality bounds the risk of averaged version of $K$ -fold cross-validation. It is a special case of Theorem 4 in the Supplementary Materials, which extends Theorem 3.5 in Lecué and Mitchell (2012). The notation $\mathbb{E}_{D^{(m)}}$ indicates the expectation over random $m$ -sample datasets $D^{(m)}$ drawn from the probability distribution $\mu$ .

Theorem 2.

Let $\Lambda=[\lambda_{\min},\lambda_{\max}]^{J}$ where $\Delta_{\Lambda}=(\lambda_{\max}-\lambda_{\min})\vee 1$ . Suppose random variables $\epsilon_{i}$ are independent with expectation zero, satisfy $\|\epsilon\|_{L_{\psi_{2}}}=b<\infty$ , and are independent of $X$ . Suppose Assumption 1 holds over the set $\mathcal{X}$ and Assumption 2 holds. Suppose there exists a function $\tilde{h}$ and some $\sigma_{0}>0$ such that

[TABLE]

Then there exists an absolute constant $c_{1}>0$ and a constant $c_{K_{0},b}>0$ such that for any $a>0$ ,

[TABLE]

As in Theorem 1, the remainder term in Theorem 2 includes a near-parametric term $O_{p}(J/n_{V})$ . So as before, adding hyper-parameters to parametric model estimation incurs a similar cost as adding parameters to the parametric model itself and adding hyper-parameters to semi- and non-parametric regression settings is relatively “cheap” and negligible asymptotically.

The differences between Theorems 1 and 2 highlight the tradeoffs made to establish an oracle inequality involving the functional $L_{2}$ -error. The biggest tradeoff is that Theorem 2 adds Assumption 2. Though we can relax Assumption 2 to hold over datasets $D$ in some high-probability set, the difficulty lies in controlling the tail behavior of the fitted models over all $\Lambda$ . For some model estimation procedures, $K_{0}$ may grow with $n$ if $\lambda_{\min}$ shrinks too quickly with $n$ . In this case, the remainder term may not longer shrink at a near-parametric rate. Unfortunately requiring $\lambda_{\min}$ to shrink at an appropriate rate seems to defeat the purpose of cross-validation. So even though Theorem 2 helps us better understand cross-validation, it is limited by this assumption. In addition, the Lipschitz assumption must hold over all $\mathcal{X}$ in Theorem 2, rather than just the observed covariates. Finally, the oracle inequality in Theorem 2 is no longer sharp since the oracle risk is scaled by $1+a$ for $a>0$ .

4 Penalized regression models

Now we apply our results to analyze penalized regression procedures of the form (2.1). Penalty functions encourage particular characteristics in the fitted models (e.g. smoothness or sparsity) and combining multiple penalty functions results in models that exhibit a combination of the desired characteristics. There is much interest in combining multiple penalty functions, but few methods incorporate more than two penalties due to (a) the concern that models may overfit the data when selection of many penalty parameters is required; and (b) computational issues in optimizing multiple penalty parameters. In this section, we evaluate the validity of concern (a) using the results of Section 3. We see that, contrary to popular wisdom, using split-sample validation to select multiple penalty parameters should not result in a drastic increase to the generalization error of the selected model.

In this section, we consider penalty parameter spaces of the form $\Lambda=[n^{-t_{\min}},n^{t_{\max}}]^{J}$ for $t_{\min},t_{\max}\geq 0$ . This regime works well for two reasons: one, our rates depend only quite weakly on $t_{\min}$ and $t_{\max}$ ; and two, oracle $\lambda$ -values are generally $O_{p}(n^{-\alpha})$ for some $\alpha\in(0,1)$ (van de Geer, 2000; van de Geer and Muro, 2015; Bühlmann and Van De Geer, 2011). So long as $t_{\min}>\alpha$ , $\Lambda$ will contain the optimal penalty parameter. We do not consider settings where $\lambda_{\min}$ shrinks faster than a polynomial rate since the fitted models can be ill-behaved.

In the following sections, we do an in-depth study of additive models of the form

[TABLE]

We first consider parametric additive models (with potentially growing numbers of parameters) fitted with smooth and non-smooth penalties and then nonparametric additive models. We find that the Lipschitz function $C_{\Lambda}(\boldsymbol{x}|T)$ scales with $n^{O_{p}(t_{\min})}$ . Applying Theorems 1 and 2, we find that the near-parametric term in the remainder only grows linearly in $t_{\min}$ . We apply these results to various additive model estimation methods. For instance, in the generalized additive model example, we show that under minimal assumptions, the error from tuning penalty parameters is negligible compared to the error from solving the penalized regression problem with oracle penalty parameters.

4.1 Parametric additive models

Parametric additive models with model parameters $\boldsymbol{\theta}=\left(\boldsymbol{\theta}^{(1)},...,\boldsymbol{\theta}^{(J)}\right)$ have the form

[TABLE]

We denote the training criterion for training data $T$ as

[TABLE]

Suppose $\boldsymbol{\theta}^{*}$ is the unique minimizer of the expected loss $\|y-g(\boldsymbol{\theta})\|^{2}_{L_{2}}$ .

4.1.1 Parametric regression with smooth penalties

We begin with the simple case where the penalty functions are smooth. The following lemma states that the fitted models are Lipschitz in the penalty parameter vector. Given matrices $A$ and $B$ , $A\succeq B$ means that $A-B$ is a positive semi-definite matrix.

Lemma 1.

Let $\Lambda\coloneqq\left[\lambda_{\min},\lambda_{\max}\right]^{J}$ where $\lambda_{\max}\geq\lambda_{\min}>0$ . For a fixed training dataset $T\equiv D^{(n_{T})}$ , suppose for all $\boldsymbol{\lambda}\in\Lambda$ , $L_{T}\left(\boldsymbol{\theta},\boldsymbol{\lambda}\right)$ has a unique minimizer

[TABLE]

Suppose for all $j=1,...,J$ , the parametric class $g_{j}$ is $\ell_{j}$ -Lipschitz in its parameters

[TABLE]

Further suppose for all $j=1,..,J$ , $P_{j}(\boldsymbol{\theta}^{(j)})$ and $g_{j}(\boldsymbol{\theta}^{(j)})(\boldsymbol{x})$ are twice-differentiable with respect to $\boldsymbol{\theta}^{(j)}$ for any fixed $\boldsymbol{x}$ . Suppose there exists an $m(T)>0$ such that the Hessian of the penalized training criterion at the minimizer satisfies

[TABLE]

where $I$ is a $p\times p$ identity matrix. Then for any $\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\in\Lambda$ , Assumption 1 is satisfied over the set $\mathcal{X}^{(1)}\times...\times\mathcal{X}^{(J)}$ with function

[TABLE]

where $C^{*}_{\Lambda}=\lambda_{max}\sum_{j=1}^{J}P_{j}(\boldsymbol{\theta}^{(j),*})$ .

Notice that Lemma 1 requires the training criterion to be strongly convex at its minimizer. This is satisfied in the following example involving multiple ridge penalties. If (4.23) is not satisfied by a penalized regression problem, one can consider a variant of the problem where the penalty functions $P_{j}(\boldsymbol{\theta}^{(j)})$ are replaced with penalty functions $P_{j}(\boldsymbol{\theta}^{(j)})+\frac{w}{2}\|\boldsymbol{\theta}^{(j)}\|_{2}^{2}$ for a fixed $w>0$ .

Example 1 (Multiple ridge penalties).

Let us consider fitting a linear model via ridge regression. If we can group covariates based on the similarity of their effects on the response, e.g. $\boldsymbol{x}=(\boldsymbol{x}^{(1)},...,\boldsymbol{x}^{(J)})$ where $\boldsymbol{x}^{(j)}$ is a vector of length $p_{j}$ , we can incorporate this prior information by penalizing each group of covariates differently:

[TABLE]

We tune the penalty parameters $\boldsymbol{\lambda}$ over the set $\Lambda$ via a training/validation split with training and validation sets $T$ and $V$ , respectively. For all the examples in this manuscript, let $\Lambda=\left[n^{-t_{\min}},1\right]^{J}$ .

Via some algebra, we can derive (4.24) in Lemma 1; the details are deferred to the Supplementary Materials. Plugging this result into Corollary 1, we find that the parametric term (3.9) in the remainder is on the order of

[TABLE]

where $C^{*}_{T}=\|\epsilon\|_{T}^{2}+\sum_{j=1}^{J}\|\boldsymbol{\theta}^{*,(j)}\|_{2}^{2}.$ So we have shown in this example that if the lower bound of $\Lambda$ shrinks at the polynomial rate $n^{-t_{\min}}$ , the near-parametric term in the remainder of the oracle inequality grows only linearly in its power $t_{\min}$ .

In the next example, we consider generalized additive models (GAMs) (Hastie and Tibshirani, 1990). Though GAMs are nonparametric models, it is well-known that they are equivalent to solving a finite-dimensional problem (Green and Silverman, 1993; O’sullivan et al., 1986; Buja et al., 1989). By reformulating GAMs as parametric models instead, we can establish oracle inequalities for tuning the penalty parameters via training/validation split. Here we present an outline of the procedure; the details can be found in the Supplementary Materials.

Example 2 (Multiple sobolev penalties).

To fit a generalized additive model over the domain $\mathcal{X}^{J}$ where $\mathcal{X}\subseteq\mathbb{R}$ , a typical setup is to solve

[TABLE]

where the penalty function is the 2nd-order Sobolev norm. Let $\mathcal{X}=[0,1]$ for this example. Using properties of the Sobolev penalty, (4.27) can be re-expressed as a finite-dimensional problem with matrices $K_{j}$

[TABLE]

Let $X_{T}\in\mathbb{R}^{n_{T}\times J}$ be the covariates $\boldsymbol{x}$ in the training data stacked together. If $X_{T}^{\top}X_{T}$ is invertible, we can derive the closed-form solution for (4.28). From there, we can directly calculate (4.24) in Lemma 1. Plugging this result into Corollary 1, we find that the parametric term in the remainder is on the order of

[TABLE]

where $\|\cdot\|_{2}$ is the spectral norm and $h_{j}(T)$ is the smallest distance between observations of the $j$ th covariates in the training data $T$ .

In particular, for $J=o(n^{1/2})$ , the smoothing spline estimate (4.27) is shown to attain the minimax optimal rate of $O_{p}(Jn^{-4/5})$ if the penalty parameters shrink at the rate of $\sim n^{-4/5}$ (Sadhanala and Tibshirani, 2017; Horowitz et al., 2006). From Corollary 1, we see that the oracle error (3.8) asymptotically dominates the additional error terms incurred from tuning the penalty parameters. Moreover, as long as we choose $\lambda_{\min}\sim n^{-\alpha}$ for any $\alpha>4/5$ , the model selected via training/validation split will also attain the minimax rate.

4.1.2 Parametric regression with non-smooth penalties

If the penalty functions are non-smooth, similar results do not necessarily hold. Nonetheless we find that for many popular non-smooth penalty functions, such as the lasso (Tibshirani, 1996) and group lasso (Yuan and Lin, 2006), the fitted functions are still smoothly parameterized by $\boldsymbol{\lambda}$ almost everywhere. To characterize such problems, we begin with the following definitions from Feng and Simon (2017):

Definition 1.

The differentiable space of function $f:\mathbb{R}^{p}\mapsto\mathbb{R}$ at $\boldsymbol{\theta}$ is

[TABLE]

Definition 2.

Let $f(\cdot,\cdot):\mathbb{R}^{p}\times\mathbb{R}^{J}\mapsto\mathbb{R}$ be a function with a unique minimizer. $S\subseteq\mathbb{R}^{p}$ is a local optimality space of $f$ over $W\subseteq\mathbb{R}^{J}$ if

[TABLE]

Using the definitions above, we can characterize the penalty parameters $\Lambda_{smooth}\subseteq\Lambda$ where the fitted functions are well-behaved.

Condition 1.

For every $\boldsymbol{\lambda}\in\Lambda_{smooth}$ , there exists a ball $B(\boldsymbol{\lambda})$ with nonzero radius centered at $\boldsymbol{\lambda}$ such that

•

For all $\boldsymbol{\lambda}^{\prime}\in B(\boldsymbol{\lambda})$ , the training criterion $L_{T}(\cdot,\boldsymbol{\lambda}^{\prime})$ is twice differentiable with respect to $\boldsymbol{\theta}$ at $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}^{\prime}|T)$ along directions in the product space

[TABLE]

•

$\Omega^{L_{T}(\cdot,\boldsymbol{\lambda})}\left(\hat{\boldsymbol{\theta}}\left(\boldsymbol{\lambda}|T\right)\right)$ is a local optimality space for $L_{T}\left(\cdot,\boldsymbol{\lambda}\right)$ over $B(\boldsymbol{\lambda})$ .

In addition, we need nearly all penalty parameters to be in $\Lambda_{smooth}$ .

Condition 2.

$\Lambda\setminus\Lambda_{smooth}$ has Lebesgue measure zero, e.g. $\mu(\Lambda_{smooth}^{c})=0$ .

For instance, in the lasso, $\Lambda_{smooth}$ is the sections of the lasso-path in between the knots. As the knots in the lasso-path are countable, the set outside $\Lambda_{smooth}$ has measure zero.

Assuming the above conditions hold, the fitted models for non-smooth penalty functions satisfy the same Lipschitz relation as that in Lemma 1.

Lemma 2.

Let $\Lambda\coloneqq\left[\lambda_{\min},\lambda_{\max}\right]^{J}$ where $\lambda_{\max}\geq\lambda_{\min}>0$ . Suppose that for all $j=1,...,J$ , $g_{j}$ satisfies (4.22) over $\mathcal{X}^{(j)}$ . Suppose for training data $T\equiv D^{(n_{T})}$ , the penalized loss function $L_{T}\left(\boldsymbol{\theta},\boldsymbol{\lambda}\right)$ has a unique minimizer $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}|T)$ for every $\boldsymbol{\lambda}\in\Lambda$ . Let $\boldsymbol{U}_{\lambda}$ be an orthonormal matrix with columns forming a basis for the differentiable space of $L_{T}(\cdot,\boldsymbol{\lambda})$ at $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}|T)$ . Suppose there exists a constant $m(T)>0$ such that the Hessian of the penalized training criterion at the minimizer taken with respect to the directions in $\boldsymbol{U}_{\lambda}$ satisfies

[TABLE]

where $\boldsymbol{I}$ is the identity matrix.* Suppose Conditions 1 and 2 are satisfied. Then any $\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\in\Lambda$ satisfies Assumption 1 over $\mathcal{X}^{(1)}\times...\times\mathcal{X}^{(J)}$ with $C_{\Lambda}$ defined in (4.24).*

As an example, we consider multiple elastic net penalties where the penalty parameters are tuned by training/validation split and cross-validation.

Example 3 (Multiple elastic nets, training/validation split).

Suppose we would like to fit a linear model via the elastic net. If the covariates are grouped a priori, we can penalize each group differently using the following objective

[TABLE]

where $w>0$ is a fixed constant. Here we briefly sketch the process for deriving the oracle inequality when the penalty parameters via training/validation split over $\Lambda=[n^{-t_{\min}},1]^{J}$ . Details are given in Supplementary Materials.

First we check that all the conditions are satisfied. For this problem, the differentiable space is the subspace spanned by the non-zero elements in $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ . Since the elastic net solution paths are piecewise linear (Zou and Hastie, 2003), the differentiable space is also a local optimality space. Then using a similar procedure as in Example 1, we find that the parametric term in the remainder of Corollary 1 is on the order of

[TABLE]

where $C^{*}_{T}=\|\epsilon\|_{T}^{2}+\sum_{j=1}^{J}2\|\boldsymbol{\theta}^{*,(j)}\|_{1}+w\|\boldsymbol{\theta}^{*,(j)}\|_{2}^{2}$ .

We can compare this additional error term to the risk of using an oracle penalty parameter. For the case of a single penalty parameter ( $J=1$ ), the convergence rate of using an oracle penalty parameter for the elastic net is on the order of $O_{p}(\log(p)/n)$ (Bunea et al., 2008; Hebiri et al., 2011). If we split the covariates into groups and tune the penalty parameters via training/validation split, the incurred error (4.35) is on a similar order.

Example 4 (Multiple elastic nets, cross-validation).

Now we establish an oracle inequality for the averaged version of $K$ -fold cross-validation using a similar setup as Lecué and Mitchell (2012). Suppose the noise $\epsilon$ is sub-gaussian and for simplicity, suppose $X$ is drawn uniformly from $[-1,1]^{p}$ . In order to satisfy the assumptions in Theorem 2, our fitting procedure for $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ entails a thresholding operation similar to that in Lecué and Mitchell (2012). In particular, we fit parameters $\hat{\boldsymbol{\theta}}_{thres}(\boldsymbol{\lambda})$ where the $i$ -th element is

[TABLE]

where $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ is the solution to (4.34) and $K_{0}^{\prime}>0$ is some fixed constant. We then find the Lipschitz factor in Lemma 3 and bound its Orlicz norm via exponential concentration inequalities. Let $\bar{\boldsymbol{\theta}}(D^{(n)})$ be the fitted parameters using the averaged version of $K$ -fold cross-validation. By Theorem 2, there is some constant $\tilde{c}>0$ , such that for any $a>0$

[TABLE]

The above example is similar to the lasso example in Lecué and Mitchell (2012); the major difference is that we consider the case where the penalty parameters are tuned over a continuous range. We are able to do this since Lemma 2 specifies a Lipschitz relation between the fitted functions and the penalty parameters. This result is relevant when $J$ is large and $\boldsymbol{\lambda}$ must be tuned via a continuous optimization procedure.

4.2 Nonparametric additive models

We now consider nonparametric additive models of the form

[TABLE]

where $\{P_{j}\}$ are penalty functionals and $\{\mathcal{G}_{j}\}$ are linear spaces of univariate functions. Let $\left\{g_{j}^{*}\right\}_{j=1}^{J}$ be the minimizer of the generalization error

[TABLE]

We obtain a similar Lipschitz relation in the nonparametric setting to those before.

Lemma 3.

Let $\lambda_{\max}>\lambda_{\min}>0$ and $\Lambda\coloneqq[\lambda_{\min},\lambda_{\max}]^{J}$ . Suppose the penalty functions $P_{j}$ are twice Gateaux differentiable and convex over $\mathcal{G}_{j}$ . Suppose there is a $m(T)>0$ such that the second Gateaux derivative of the training criterion at $\{\hat{g}^{(n_{T})}_{j}(\boldsymbol{\lambda}|T)\}$ for all $\boldsymbol{\lambda}\in\Lambda$ satisfies

[TABLE]

where $D^{2}_{\{g_{j}\}}$ is the second Gateaux derivative taken in directions $\{g_{j}\}$ . Let $C_{\Lambda}^{*}=\lambda_{max}\sum_{j=1}^{J}P_{j}(g^{*}_{j}).$ For any $\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\in\Lambda$ , we have

[TABLE]

A simple example that satisfies (4.40) is a penalized regression model where we fit values at each of the observed covariates, e.g. $\hat{\boldsymbol{\theta}}\in\mathbb{R}^{n}$ , and penalize this fitted value by a ridge penalty. Note that such a penalty is allowed because the response $y$ in the validation set is not used by the training procedure.

Note that since Lemma 3 verifies that Assumption 1 is satisfied over the observed covariates, it is suitable to be used in Theorem 1. However (4.41) is not a strong enough statement to be used for Theorem 2.

5 Simulations

We now present a simulation study of the generalized additive model in Example 2 to understand how the performance changes as the number of penalty parameters $J$ increases. Corollary 1 suggests that there are two opposing forces that affect the error of the fitted model. On one hand, (3.9) is linear in $J$ so increasing $J$ can increase the error. On the other hand, (3.8) decreases for larger model spaces, so increasing $J$ may decrease the error. We isolate these two behaviors via two simulation setups.

The data is generated as the sum of univariate functions $Y=\sum_{j=1}^{J}g_{j}^{*}(X_{j})+\sigma\epsilon$ , where $\epsilon$ are iid standard Gaussian random variables and $\sigma>0$ is chosen such that the signal to noise ratio is two. $X$ is drawn from a uniform distribution over $\mathcal{X}=[-2,2]^{J}$ . We fit models by minimizing (4.27). To vary the number of free penalty parameters, we constrain certain $\lambda_{j}$ to be equal while allowing others to be completely free. (For instance, for a single penalty parameter, we constrain $\lambda_{j}$ for $j=1,...,J$ to be the same value.) The penalty parameters are tuned using a training/validation split.

Simulation 1: The true function is the sum of identical sinusoids $g_{j}^{*}(x_{j})=\sin(x_{j})$ for $j=1,...,J$ . Since the univariate functions are the same, the oracle risk should be roughly constant as we increase the number of free penalty parameters. The validation loss difference

[TABLE]

should grow linearly in $J$ for this simulation setup.

Simulation 2: The true function is the sum of sinusoids with increasing frequency $g_{j}^{*}(x_{j})=\sin(x_{j}*1.2^{j-4})$ for $j=1,...,J$ . Since the Sobolev norms of $g_{j}^{*}$ increase with $j$ , we expect that the penalty parameters that attain the oracle risk to be monotonically decreasing, e.g. ${\lambda}_{1}>...>{\lambda}_{J}$ . As the number of penalty parameters increases, we expect the oracle risk to shrink. If the oracle risk shrinks fast enough, performance of the selected model should improve.

For both simulations, we use $J=8$ . Each simulation was replicated forty times with 200 training and 200 validation samples. We consider $k=1,2,4,8$ free penalty parameters by structuring the penalty parameters in a nested fashion: for each $k$ , we constrained $\{\lambda_{8\ell/k+j}\}_{j=1,...,8/k}$ to be equal for $\ell=0,...,k-1$ . Penalty parameters were tuned using nlm in R with initializations at $\{\vec{1},0.1\times\vec{1},0.01\times\vec{1}\}$ . We did not use grid-search since it is computationally intractable for large numbers of penalty parameters. Multiple initializations were required since the validation loss is not convex in the penalty parameters.

As expected, the validation loss difference increases with the number of penalty parameters in Simulation 1 (Figure 1(a)). To see if our oracle inequalities match the empirical results, we regressed the logarithm of the validation loss difference against the logarithm of the number of penalty parameters. We fit the model using simulation results with at least two penalty parameters as the data is highly skewed for the single penalty parameter case. We estimated a slope of 1.00 (standard error 0.15), which suggests that the validation loss difference grows linearly in the number of penalty parameters. Interestingly, including the single parameter case gives us a slope of 1.45 (standard error 0.14). This suggests that our oracle inequality might not be tight for the single penalty parameter case.

For Simulation 2, the validation loss of the selected model decreases as the number of penalty parameters increases. As suggested in Figure 1(b), the validation loss of the selected model decreases because the oracle risk is decreasing at a faster rate than the rate at which the additional error (3.9) grows.

These simulation results suggest that adding more hyper-parameters can improve model estimates. Having a separate penalty parameter allows GAMs to fit components with differing smoothness. However if we know a priori that the components have the same smoothness, then it is best to use a single penalty parameter.

6 Discussion

In this manuscript, we have characterized the generalization error of split-sample procedures that tune multiple hyper-parameters. If the estimated models are Lipschitz in the hyper-parameters, the generalization error of the selected model is upper bounded by a combination of the oracle risk and a near-parametric term in the number of hyper-parameters. These results show that adding hyper-parameters can decrease the generalization error of the selected model if the oracle risk decreases by a sufficient amount. In the semi- or non-parametric setting, the error incurred from tuning hyper-parameters is dominated by the oracle risk asymptotically; adding hyper-parameters has a negligible effect on the generalization error of the selected model. In the parametric setting, the error incurred from tuning hyper-parameters is on the same order as the oracle error; one should be careful about adding hyper-parameters, though they are not more “costly” than model parameters.

We also showed that many penalized regression examples satisfy the Lipschitz condition so our theoretical results apply. This implies that fitting models with multiple penalties and penalty parameters can be desirable, rather than the usual case with one or two penalty parameters.

One drawback of our theoretical results is that we have assumed that selected hyper-parameter is a global minimizer of the validation loss. Unfortunately this is not achievable in practice since the validation loss is not convex with respect to the hyper-parameters. This problem is exacerbated when there are many hyper-parameters since it is computationally infeasible to perform an exhaustive grid-search. We hope to address this question in future research.

Appendix A Supplementary Materials

We will use the following notation: for functions $f$ and $g$ and a dataset $D$ with $m$ samples, we denote the inner product of $f$ and $g$ at covariates $D$ as $\langle f,g\rangle_{D}=\frac{1}{m}\sum_{(x_{i},y_{i})\in D}f(x_{i},y_{i})g(x_{i},y_{i})$ .

A.1 A single training/validation split

Theorem 1 is a special case of Theorem 3, which applies to general model-estimation procedures. The proof is based on the so-called “basic inequality” below.

Lemma 4.

For any $\tilde{\boldsymbol{\lambda}}\in\tilde{\Lambda}$ , we have

[TABLE]

Proof.

The desired result can be attained by rearranging the definition of $\hat{\boldsymbol{\lambda}}$

[TABLE]

∎

We are therefore interested in bounding the empirical process term in (A.43). A common approach is to use a measure of complexity of the function class. For a single training/validation split, where we treat the training set as fixed, we only need to consider the complexity of the fitted models from the model-selection procedure

[TABLE]

This model class can be considerably less complex compared to the original function class $\mathcal{G}$ , such as the special case in Theorem 1 where we suppose $\mathcal{G}(T)$ is Lipschitz. For this proof, we will use metric entropy as a measure of model class complexity. We recall its definition below.

Definition 3.

Let $\mathcal{F}$ be a function class. Let the covering number $N(u,\mathcal{F},\|\cdot\|)$ be the smallest set of $u$ -covers of $\mathcal{F}$ with respect to the norm $\|\cdot\|$ . The metric entropy of $\mathcal{F}$ is defined as the log of the covering number:

[TABLE]

We will bound the empirical process term using the following Lemma, which is a simplification of Corollary 8.3 in van de Geer (2000).

Lemma 5.

Suppose $D^{(m)}=\{x_{1},...,x_{m}\}$ are fixed and $\epsilon_{1},...,\epsilon_{m}$ are independent random variables with mean zero and uniformly sub-gaussian with parameters $b$ and $B$ . Suppose the model class $\mathcal{F}$ satisfies $\sup_{f\in\mathcal{F}}\|f\|_{D^{(m)}}\leq R$ and

[TABLE]

There is a constant $a>0$ dependent only on $b$ and $B$ such that for all $\delta>0$ satisfying

[TABLE]

we have

[TABLE]

We are now ready to prove the oracle inequality. It uses a standard peeling argument.

Theorem 3.

Consider a set of hyper-parameters $\Lambda$ . Let training data $T$ be fixed, as well as the covariates of the validation set $X_{V}$ . Let the oracle risk be denoted

[TABLE]

Suppose independent random variables $\epsilon_{i}$ for validation set $V$ have expectation zero and are uniformly sub-Gaussian with parameter $b$ and $B$ . Suppose there is a function $\mathcal{J}(\cdot|T):\mathbb{R}\mapsto\mathbb{R}$ and constant $r>0$ such that

[TABLE]

Also, suppose $\mathcal{J}\left(u|T\right)/u^{2}$ is non-increasing in $u$ for all $u>r$ .

Then there is a constant $c>0$ only depending on $b$ and $B$ such that for all $\delta$ satisfying

[TABLE]

we have

[TABLE]

Proof.

Consider any $\tilde{\boldsymbol{\lambda}}\in\tilde{\Lambda}$ . We will use the simplified notation $\hat{g}(\hat{\boldsymbol{\lambda}})\coloneqq\hat{g}^{(n_{T})}(\hat{\boldsymbol{\lambda}}|T)$ and $\hat{g}(\tilde{\boldsymbol{\lambda}})\coloneqq\hat{g}^{(n_{T})}(\tilde{\boldsymbol{\lambda}}|T)$ . In addition, the following probabilities are all conditional on $X_{V}$ and $T$ but we leave them out for readability.

[TABLE]

where we applied the basic inequality (A.43) in the last line. Each summand in (A.53) can be bounded by splitting the event into the cases where either $2^{2s+2}\delta^{2}$ or $2\left|\left\langle\hat{g}(\tilde{\boldsymbol{\lambda}})-\hat{g}(\hat{\boldsymbol{\lambda}}),\hat{g}(\tilde{\boldsymbol{\lambda}})-g^{*}\right\rangle_{V}\right|$ is larger. Splitting up the probability and applying Cauchy Schwarz gives us the following bound for (A.51)

[TABLE]

We can bound both (A.55) and (A.56) using Lemma 5. For our choice of $\delta$ in (A.49), there is some constant $a>0$ dependent only on $b$ such that (A.55) is bounded above by

[TABLE]

In addition, our choice of $\delta$ from (A.49) and our assumption that $\psi(u)/u^{2}$ is non-increasing implies that the condition in Lemma 5 is satisfied for all $s=0,1,...,\infty$ simultaneously. Hence for all $s=0,1,...,\infty$ , we have

[TABLE]

Putting this all together, we have that there is a constant $c$ such that (A.51) is bounded above by

[TABLE]

∎

We can apply Theorem 3 to get Theorem 1. Before proceeding, we determine the entropy of $\mathcal{G}(T)$ when the functions are Lipschitz in the hyper-parameters.

Lemma 6.

Let $\Lambda=[\lambda_{\min},\lambda_{\max}]^{J}$ where $\lambda_{\min}\leq\lambda_{\max}$ . Suppose $\mathcal{G}(T)$ is Lipschitz with function $C(\cdot|T)$ over $\boldsymbol{\lambda}$ . Then the entropy of $\mathcal{G}(T)$ with respect to $\|\cdot\|$ is

[TABLE]

Proof.

Using a slight variation of the proof for Lemma 2.5 in van de Geer (2000), we can show

[TABLE]

Under the Lipschitz assumption, a $\delta$ -cover for $\Lambda$ is a $\|C(\cdot|T)\|\delta$ -cover for $\mathcal{G}(T)$ . The covering number for $\mathcal{G}(T)$ wrt $\|\cdot\|$ is bounded by the covering number for $\Lambda$ as follows

[TABLE]

∎

A.1.1 Proof for Theorem 1

Proof.

By Lemma 6, we have

[TABLE]

If we restrict $R>n^{-1}$ , then for an absolute constant $c$ , we have

[TABLE]

Applying Theorem 3, we get our desired result. ∎

A.2 Cross-validation

In order to obtain an oracle inequality for averaged version of cross-validation, we need to extend Theorem 3.5 in Lecué and Mitchell (2012). Let the class of fitted functions for given training data $T$ be denoted

[TABLE]

In Lecué and Mitchell (2012), they assume that there is a function $\mathcal{J}$ that uniformly bounds the size of the class $\mathcal{G}(T)$ for any training data $T$ . However the complexity of $\mathcal{G}(T)$ depends on training data – for instance, if there is a lot of noise in the training data, the size of $\mathcal{G}(T)$ can be very high. In our extension, we allow the function $\mathcal{J}$ to depend on the training data.

Throughout this section, we use Talagrand’s gamma function (Talagrand, 2005) to characterize the size of a function class. We present it below as it will be used later on.

Definition 4.

For metric space $(T,d)$ and $\alpha\geq 0$ , define

[TABLE]

where the infimum is taken over all sequences $\{T_{s}:s\in\mathbb{N},T_{s}\subseteq T,|T_{s}|\leq 2^{2^{s}}\}$ . (Here, $|A|$ denotes the cardinality of the set $A$ .)

We begin with some notation. Suppose we have a measurable space $(\mathcal{Z},\mathcal{T})$ where we observe $Z=(X,y)$ random variables with values in $\mathcal{Z}$ . Let $\mathcal{G}$ is a class of measurable functions from $\mathcal{Z}\mapsto\mathbb{R}$ ; the model-estimation procedure selects functions from the class $\mathcal{G}$ . In contrast to the main manuscript, we will consider a very general setting. In particular, the noise $\epsilon=y-E[y|X=x]$ is not necessarily independent of $X$ . In addition, we consider a general loss function $Q:\mathcal{Z}\times\mathcal{G}\mapsto\mathbb{R}$ (rather than solely the least squares loss). Define the risk function $R(g)$ as the expected loss $\mathbb{E}Q(Z,g)$ and suppose the risk function is convex. Let $\bar{g}^{(n)}(D^{(n)})$ denote the averaged version of cross-validation and $g^{*}$ denote the minimizer of the risk function over $\mathcal{G}$ .

In this more general setting, we require a more general version of Assumption 2:

Assumption 3.

There exist constants $K_{0},K_{1}\geq 0$ and $\kappa\geq 1$ such that for any $m\in\mathbb{N}$ and any dataset $D^{(m)}$ ,

[TABLE]

Our theorem relies on the basic inequality established in Lemma 3.1 in Lecué and Mitchell (2012). We reproduce it here for convenience. From henceforth, $c_{i}>0$ denotes absolute constants, that may not necessarily be the same if they share the same subscript.

Lemma 7.

For any constant $a>0$ , we have the following inequality

[TABLE]

where $P_{n_{V}}=1/n_{V}\sum_{i=n_{T}+1}^{n}\delta_{Z_{i}}$ is the empirical probability measure on $\{Z_{n_{T}+1},...,Z_{n}\}$ .

We need to bound the supremum of the second term on the right hand side, which is a shifted empirical process term. Lemma 3.4 in Lecué and Mitchell (2012) already bounds the shifted empirical process term. However to extend their result to our purposes, we restate it to clarify the conditional dependencies. This allows us to introduce two new functions $h$ and $J_{\delta}$ that will be used later on.

Lemma 8.

Let $\mathcal{Q}(D^{(m)})\equiv\left\{Q(\lambda|D^{(m)}):\lambda\in\Lambda\right\}$ and $\mathcal{Q}\equiv\cup_{m\in\mathbb{N}}\cup_{D^{(m)}}\mathcal{Q}(D^{(m)})$ . Suppose there exists $C_{1}>0$ and an increasing function $G(\cdot)$ such that $\forall Q\in\mathcal{Q}$ ,

[TABLE]

Let $n_{T},n_{V}\in\mathbb{N}$ . Suppose there exists a function $h$ that maps training data $D^{(n_{T})}$ to $\mathbb{R}^{+}$ , a function $J_{\delta}:\mathbb{R}^{+}\mapsto\mathbb{R}^{+}$ indexed by $\delta>0$ , and a constant $w_{\min}>0$ such that for any dataset $D^{(n_{T})}$ and any $w\geq w_{\min}$ ,

[TABLE]

where $\mathcal{Q}_{w}^{L_{2}}(D^{(n_{T})})\equiv\left\{Q\in\mathcal{Q}(D^{(n_{T})}):\|Q(Z)\|_{L_{2}}\leq G(w)\right\}$ .

Then there exists absolute constants $L,c>0$ such that for all $w\geq w_{\min}$ and all $u\geq 1$ ,

[TABLE]

Now that we have established a concentration inequality for the function class $\{Q\in\mathcal{Q}(D^{(n_{T})}):PQ\leq w\}$ , we need to aggregate the results to establish a concentration inequality for the function class $\mathcal{Q}(D^{(n_{T})})$ . Again, we use Lemma 3.2 in Lecué and Mitchell (2012) but restate it using our new functions $h$ and $J_{\delta}$ .

Lemma 9.

Let $a>0$ . Let $\mathcal{Q}(D^{(m)})\equiv\left\{Q(\lambda|D^{(m)}):\lambda\in\Lambda\right\}$ be a set of measurable functions. For all $m\in\mathbb{N}$ and any dataset $D^{(m)}$ , suppose $\mathbb{E}Q(Z)\geq 0$ for all $Q\in\mathcal{Q}\left(D^{(m)}\right)$ .

Suppose for any $n_{T},n_{V}\in\mathbb{N}$ and dataset $D^{(n_{T})}$ there exists some absolute constant $L,c>0$ such that for all $w\geq w_{\min}$ and for all $u\geq 1$ ,

[TABLE]

For any $\delta>0$ , suppose $J_{\delta}$ is strictly increasing and its inverse is strictly convex. Let $\psi_{\delta}$ be the convex conjugate of $J_{\delta}^{-1}$ , e.g. $\psi_{\delta}(u)=\sup_{v>0}uv-J_{\delta}^{-1}(v)$ for all $u>0$ . Assume there is a $r\geq 1$ such that $x>0\mapsto\psi_{\delta}(x)/x^{r}$ decreases. For all $q>1$ and $u\geq 1$ , define

[TABLE]

Then there exists a constant $L_{1}$ that only depends on $L$ such that for every $u\geq 1$ ,

[TABLE]

Moreover, assume that $\psi_{\delta}(x)$ is an increasing function in $x$ such that $\psi_{\delta}(\infty)=\infty$ . Then there exists a constant $c_{1}$ that depends only on $L$ and $c$ such that

[TABLE]

Finally, we are ready to bound the expectation of the shifted empirical process term in (A.71). We accomplish this via a simple chaining argument; we omit its proof as this is a standard application of the chaining argument.

Lemma 10.

Consider any $a>0$ . Suppose there exists a constant $c_{1}$ such that for any $n_{T},n_{V}\in\mathbb{N}$ , $\delta>0$ , and $q>1$ , (A.74) holds. Then for any $\sigma>0$ , we have

[TABLE]

Putting Lemmas 7 and 10 together, we have the following result.

Theorem 4.

Consider a set of hyper-parameters $\Lambda$ . Consider a loss function $Q:(\mathcal{Z},\mathcal{G})\mapsto\mathbb{R}$ with convex risk function $R:\mathcal{G}\mapsto\mathbb{R}$ . Let

[TABLE]

Suppose Assumption 3 holds. Suppose there is an $w_{\min}>0$ and functions $h:{\mathcal{Z}}^{(n_{T})}\mapsto\mathbb{R}$ and $\mathcal{J}_{\delta}:\mathbb{R}\mapsto\mathbb{R}$ such that for all $w\geq w_{\min}$ ,

[TABLE]

where $\mathcal{Q}_{w}=\{Q\in\mathcal{Q}:\|Q\|_{L_{2}}\leq w^{1/2\kappa}\}$ . Moreover, suppose that for all $\delta>0$ , $J_{\delta}$ is a strictly increasing function and $\mathcal{J}_{\delta}^{-1}(\epsilon)$ is strictly convex. Let the convex conjugate of $\mathcal{J}^{-1}_{\delta}$ be denoted $\psi_{\delta}$ . Suppose $\psi_{\delta}(x)$ increases in $x$ , $\psi_{\delta}(\infty)=\infty$ , and there exists $r\geq 1$ such that $\psi_{\delta}(x)/x^{r}$ decreases.

Consider any $\sigma>0$ . Then there is a constant $c>0$ such that for every $a>0$ and $q>1$ , the following inequality holds

[TABLE]

where $\tilde{\psi}_{q,\delta}(u)=\psi_{\delta}\left(\frac{2q^{r+1}(1+a)u}{a\sqrt{n_{V}}}\right)\vee w_{\min}$ for all $u>0$ .

Of course, this theorem is only useful if we can show that $h(D^{(n_{T})})$ is bounded with high probability. For instance, in an example in the main manuscript, we show that $h(D^{(n_{T})})$ has sub-exponential tails; so the latter term in (A.76) is well-controlled.

We now apply Theorem 4 to prove Theorem 2. Recall that Theorem 2 concerns the squared error loss $Q((x,y),g)=(y-g(x))^{2}$ and only considers model-estimation methods where the estimated functions are Lipschitz in the hyper-parameters. First we need the following lemma that describes the relationship between Lipschitz functions

Lemma 11.

Suppose the same conditions as Theorem 4. Suppose Assumptions 1 and 2 hold. Also suppose that $\|\epsilon\|_{L_{\psi_{2}}}=b<\infty$ . Define $\mathcal{Q}_{w}^{L_{2}}=\{g^{*}-\hat{g}(\boldsymbol{\lambda}|D^{(n_{T})}):P(g^{*}-\hat{g}(\boldsymbol{\lambda}|D^{(n_{T})}))^{2}<w\}$ for $w>0$ . Then there is an absolute constant $c_{0}>0$ such that

[TABLE]

then we also have

[TABLE]

for a constant $c_{K_{0},b}>0$ that only depends on $K_{0}$ and $b$ .

Proof.

Let us first consider a general norm $\|\cdot\|$ such that for any random variables $X,Y$ , we have $\|XY\|\leq\|X\|_{*}\|Y\|_{*}$ . Then for all $\boldsymbol{\lambda}\in\Lambda$ such that $P(g^{*}-\hat{g}(\boldsymbol{\lambda}|D^{n_{T}}))^{2}\leq w$ , we have

[TABLE]

For $\|\cdot\|=\|\cdot\|_{L_{2}}$ , the $L_{2}$ norm is its own dual norm so (A.83) reduces to

[TABLE]

for an absolute constant $c_{0}>0$ .

For $\|\cdot\|=\|\cdot\|_{L_{\psi_{1}}}$ , the dual of the $L_{\psi_{1}}$ norm is $L_{\psi_{2}}$ . Thus applying Assumption 2 and the fact that $\|\epsilon\|_{L_{\psi_{2}}}=b<\infty$ , (A.83) reduces to

[TABLE]

∎

Talagrand’s gamma function of a class $T$ can be bounded by Dudley’s integral

[TABLE]

(Talagrand, 2005). Combining the above bound with Lemma 11 gives the following lemma.

Lemma 12.

Suppose Assumptions 1 and 2 hold. Suppose $\|\epsilon\|_{L_{\psi_{2}}}=b<\infty$ . Define $\mathcal{Q}_{w}^{L_{2}}$ as before. For $\Lambda$ , let $\Delta_{\Lambda}=(\lambda_{\max}-\lambda_{\min})\vee 1$ . Let $w>0$ . Let $\mathcal{Q}_{w}^{L_{2}}(D^{(n_{T})})$ be defined as before.

Then there exist absolute constants $c_{0},c_{1}>0$ and a constant $c_{K_{0},b}>0$ such that

[TABLE]

Proof.

By definition of $\mathcal{Q}_{w}^{L_{2}}$ , we have $\operatorname{Diam}\left(\mathcal{Q}_{w}^{L_{2}}(D^{(n_{T})}),\|\cdot\|_{L_{2}}\right)=2\sqrt{w}.$ Using Lemma 11 and (A.84), we have

[TABLE]

Using very similar logic, we now bound the $\gamma_{1}$ function. First we bound the diameter of $\mathcal{Q}_{w}^{L_{2}}$ with respect to the norm $\|\cdot\|_{L_{\psi_{1}}}$ :

[TABLE]

Thus

[TABLE]

∎

To apply Theorem 4, we need to define $h$ and $J_{\delta}$ so that (A.75) is satisfied. Based on the lemma above, we see that it suffices to let

[TABLE]

and

[TABLE]

Finally using the results above, we can prove Theorem 2.

Proof for Theorem 2.

We now apply Theorem 4 to our Lipschitz case. From (A.91), we find that Assumption 3 is satisfied. We have defined $h$ and $J_{\delta}$ so that (A.75) is satisfied for all $w\geq 1/n$ . Moreover, $\mathcal{J}_{\delta}(w)$ is strictly increasing and concave in $w$ . This implies that $\mathcal{J}_{\delta}^{-1}$ is strictly convex. Via algebra, we find that the convex conjugate of $\mathcal{J}_{\delta}^{-1}$ is

[TABLE]

Now let us determine $\tilde{\psi}_{q,\delta}(1/q)$ as $q\rightarrow 1$ . We have

[TABLE]

So the summation in (A.76) reduces to

[TABLE]

Taking $q\rightarrow 1$ in (A.76) and plugging in (A.101) to Theorem 4, we get our desired result. ∎

A.3 Penalized regression for additive models

We now show that penalized regression problems for additive models satisfy the Lipschitz condition.

A.3.1 Proof for Lemma 1

Proof.

We will use the notation $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})\coloneqq\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}|T)$ . By the gradient optimality conditions, we have

[TABLE]

After implicitly differentiating with respect to $\boldsymbol{\lambda}$ , we have

[TABLE]

From the product rule and chain rule, we can then write the system of equations in (A.103) as

[TABLE]

We can bound the norm of the second term in (A.104) by rearranging (A.102) and using the Cauchy-Schwarz inequality:

[TABLE]

Since $g_{j}$ is Lipschitz by assumption, then

[TABLE]

Also, by the definition of $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ , we have

[TABLE]

Hence

[TABLE]

Plugging in the results from above and using the assumption that the Hessian of the objective function has a minimum eigenvalue of $m(T)$ , we have for all

[TABLE]

Since the norm of the gradient is bounded, $\hat{\boldsymbol{\theta}}^{(j)}(\boldsymbol{\lambda})$ must be Lipschitz:

[TABLE]

Finally we combine the above results to get

[TABLE]

∎

A.3.2 Proof for Lemma 2

Before proving Lemma 2, we need to introduce some notation. Let $\mathcal{L}(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)})$ be the line segment connecting $\boldsymbol{\lambda}^{(1)}$ and $\boldsymbol{\lambda}^{(2)}$ . Let $\mu_{1}(z)$ be the 1-dimensional Lebesgue measure in the direction of $z$ (so if $z$ is a continuous line segment, $\mu_{1}(z)=\|z\|_{2}$ ; if $z$ is composed of multiple line segments $z_{i}$ , then $\mu(z)=\sum\mu(z_{i})$ ).

Before proving the Lipschitz property over all of $\Lambda$ , we show that the fitted function is Lipschitz over $\Lambda_{smooth}$ . For convenience, define $\Lambda_{smooth}^{c}\coloneqq\Lambda\setminus\Lambda_{smooth}$ .

Lemma 13.

Suppose that $g_{j}(\boldsymbol{\theta})(x)$ satisfies the Lipschitz condition in Lemma 1. Let $T\equiv D^{(n_{T})}$ be a fixed set of training data. Suppose the penalized loss function $L_{T}\left(\boldsymbol{\theta},\boldsymbol{\lambda}\right)$ has a unique minimizer $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}|T)$ for every $\boldsymbol{\lambda}\in\Lambda$ . Let $\boldsymbol{U}_{\lambda}$ be an orthonormal matrix with columns forming a basis for the differentiable space of $L_{T}(\cdot,\boldsymbol{\lambda})$ at $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}|T)$ . Suppose there exists a constant $m(T)>0$ such that the Hessian of the penalized training criterion at the minimizer taken with respect to the directions in $\boldsymbol{U}_{\lambda}$ satisfies

[TABLE]

where $\boldsymbol{I}$ is the identity matrix.* Suppose Condition 1 is satisfied by some $\Lambda_{smooth}\subseteq\Lambda$ . Define*

[TABLE]

Then any $(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)})\in\Lambda_{ext}^{c}$ satisfies (4.24).

Proof.

From Condition 1, every point $\boldsymbol{\lambda}\in\Lambda_{smooth}$ is the center of a ball $B(\boldsymbol{\lambda})$ with nonzero radius where the differentiable space within $B(\boldsymbol{\lambda})$ is constant.

Now consider any $\boldsymbol{\lambda^{(1)}},\boldsymbol{\lambda^{(2)}}\in\Lambda_{ext}$ . By (A.118), there must exist a countable set of points $\cup_{i=1}^{\infty}\boldsymbol{\ell}^{(i)}\subset\mathcal{L}(\boldsymbol{\lambda^{(1)}},\boldsymbol{\lambda^{(2)}})$ where $\cup_{i=1}^{\infty}\boldsymbol{\ell}^{(i)}\subset\Lambda_{smooth}$ , $\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\in\cup_{i=1}^{\infty}\boldsymbol{\ell}^{(i)}$ , and the union of their differentiable neighborhoods cover $\mathcal{L}(\boldsymbol{\lambda^{(1)}},\boldsymbol{\lambda^{(2)}})$ entirely:

[TABLE]

Consider the intersections of boundaries of the differentiable neighborhoods with the line segment:

[TABLE]

Every point $p\in P$ can be expressed as $\alpha_{p}\boldsymbol{\lambda^{(1)}}+(1-\alpha_{p})\boldsymbol{\lambda^{(2)}}$ for some $\alpha_{p}\in[0,1]$ . We can order the points in $P$ by increasing $\alpha_{p}$ to get the sequence $\boldsymbol{p}^{(1)},\boldsymbol{p}^{(2)},...$ .

By Condition 1, the differentiable space of the training criterion is constant over $\mathcal{L}\left(\boldsymbol{p}^{(i)},\boldsymbol{p}^{(i+1)}\right)$ since each of these sub-segments are contained in some $B(\boldsymbol{\ell}^{(i)})$ for $i\in\mathbb{N}$ . Moreover, the differentiable space over the interior of line segment $\mathcal{L}\left(\boldsymbol{p^{(i)},p^{(i+1)}}\right)$ can be decomposed as the product of differentiable spaces, which we denote as

[TABLE]

By Condition 1, (A.120) is also a local optimality space. Let $U^{(i,j)}$ be an orthonormal basis of $\Omega_{i}^{(j)}$ for $j=1,...,J$ . For each $i$ , we can express $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}|T)$ for all $\boldsymbol{\lambda}\in\mbox{Int}\left\{\mathcal{L}\left(\boldsymbol{p^{(i)},p^{(i+1)}}\right)\right\}$ as

[TABLE]

We can show that the fitted parameters satisfy the Lipschitz condition (A.111) over $\Lambda=\mathcal{L}\left(\boldsymbol{p^{(i)},p^{(i+1)}}\right)$ by using a similar proof as in Lemma 1. The only difference is that the proofs starts with taking directional derivatives along the columns of $U^{(i)}=(U^{(i,1)}...U^{(i,J)})$ to establish the KKT conditions. Then for all $j$ and $i$ , we have

[TABLE]

We can sum these inequalities by the triangle inequality:

[TABLE]

Finally, using the fact that $g_{j}$ is $\ell_{j}$ -Lipschitz, we have by the triangle inequality and Cauchy Schwarz that

[TABLE]

∎

In order to extend the result in Lemma 13 to all of $\Lambda$ , we need to show that $\Lambda_{ext}$ is a set with measure zero.

Lemma 14.

Suppose Condition 2. Then $\mu_{2J}(\Lambda_{ext})=0$ where $\mu_{2J}$ is the Lebesgue measure in $\mathbb{R}^{2J}$ and $\Lambda_{ext}$ was defined in (A.118).

Proof.

Suppose for contradiction that $\mu_{2J}(\Lambda_{ext})>0$ . If this is the case, then there exists a ball $B_{r}\left(\left(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\right)\right)$ contained in $\Lambda_{ext}$ with nonzero radius $r>0$ centered at $\left(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\right)$ where $\boldsymbol{\lambda}^{(1)}\neq\boldsymbol{\lambda}^{(2)}$ and

[TABLE]

Suppose that $\mu_{1}\left(\mathcal{L}\left(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\right)\cap\Lambda_{smooth}^{c}\right)=\delta>0$ . We claim that for a sufficiently small radius $r^{\prime}$ , we also have

[TABLE]

To see why this claim is true, let us define a monotonically decreasing sequence $\left\{r_{i}\right\}$ where $r_{i}>0$ for all $i\in\mathbb{N}$ and $\lim_{i\rightarrow\infty}r_{i}=0$ . By the monotone convergence theorem,

[TABLE]

By the definition of limits, there is some sufficiently large $i^{\prime}$ such that for $r^{\prime}\coloneqq r_{i^{\prime}}>0$ , we have

[TABLE]

Given our ball is non-empty, there exist points $\left(\boldsymbol{\lambda}^{(3)},\boldsymbol{\lambda}^{(4)}\right),\left(\boldsymbol{\lambda}^{(5)},\boldsymbol{\lambda}^{(6)}\right)\in B_{r^{\prime}}\left(\left(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\right)\right)$ where

[TABLE]

For any $\alpha\in(0,1)$ , the line

[TABLE]

has

[TABLE]

As the lines $\mathcal{L}_{\alpha}$ do not intersect for $\alpha\in(0,1)$ , then

[TABLE]

Thus

[TABLE]

However this is a contradiction of our assumption that $\mu\left(\Lambda_{smooth}^{c}\right)=0$ . ∎

Finally, combining Lemmas 13 and 14, we can show that the Lipschitz condition is satisfied over all of $\Lambda$ .

Proof for Lemma 2.

Since we already showed Lemma 13, it suffices to show that the Lipschitz condition is satisfied for any $\boldsymbol{\lambda^{(1)}},\boldsymbol{\lambda^{(2)}}\in\Lambda_{ext}$ . Lemma 14 states that $\mu_{2J}(\Lambda_{ext})=0$ , which means that there exists a sequence $\left\{\left(\boldsymbol{\lambda}^{(1,i)},\boldsymbol{\lambda}^{(2,i)}\right)\right\}_{i=1}^{\infty}\subseteq\Lambda_{ext}^{c}$ such that $\lim_{i\rightarrow\infty}\left(\boldsymbol{\lambda}^{(1,i)},\boldsymbol{\lambda}^{(2,i)}\right)=\left(\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\right)$ . As $L_{T}$ is continuous and we have assumed that there exists a unique minimizer of $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ for all $\boldsymbol{\lambda}\in\Lambda$ , then $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ is continuous in $\boldsymbol{\lambda}$ over all $\Lambda$ . As $g(\boldsymbol{\theta})(x)$ is also continuous in $\boldsymbol{\theta}$ , then for any $\boldsymbol{\lambda}^{(1)},\boldsymbol{\lambda}^{(2)}\in\Lambda$ , we have

[TABLE]

where $C_{\Lambda}(\boldsymbol{x}|T)$ is defined in (A.122). ∎

A.3.3 Proof for Lemma 3

Proof.

Let $H_{0}=\left\{j:\left\|\hat{g}_{j}(\boldsymbol{\lambda}^{(2)}|T)-\hat{g}_{j}(\boldsymbol{\lambda}^{(1)}|T)\right\|_{D^{(n)}}\neq 0\,\,\forall j=1,...,J\right\}$ . For all $j\in H_{0}$ , let

[TABLE]

For notational convenience, let $\hat{g}_{1,j}=\hat{g}_{j}(\boldsymbol{\lambda}^{(1)}|T)$ . Consider the optimization problem

[TABLE]

By the gradient optimality conditions, we have

[TABLE]

Implicit differentiation with respect to $\boldsymbol{\lambda}$ gives us

[TABLE]

From the product rule and chain rule, we can write the system of equations from (A.137) as

[TABLE]

where $L_{T}(\boldsymbol{m},\boldsymbol{\lambda})$ is the loss in (A.136).

We now bound the second term in (A.138). From (A.136) and Cauchy Schwarz, we have for all $k=1,...,J$

[TABLE]

From the definition of $h_{k}$ , we know that $\|h_{k}\|_{T}\leq\sqrt{\frac{n_{D}}{n_{T}}}$ . By definition of $\hat{m}(\boldsymbol{\lambda})$ and $\hat{g}_{1}$ , we also have

[TABLE]

Hence

[TABLE]

By (4.40), we know $\nabla_{m}^{2}L_{T}(\boldsymbol{m},\boldsymbol{\lambda})\succeq m(T)I$ . So for all $k$ ,

[TABLE]

By the mean value inequality and Cauchy Schwarz, we have

[TABLE]

By construction, $\left|\hat{m}_{k}(\boldsymbol{\lambda}^{(2)})-\hat{m}_{k}(\boldsymbol{\lambda}^{(1)})\right|=\left\|\hat{g}_{k}(\boldsymbol{\lambda}^{(2)}|T)-\hat{g}_{k}(\boldsymbol{\lambda}^{(1)}|T)\right\|_{D^{(n)}}$ . So we obtain our desired result in (4.41). ∎

A.4 Examples: detailed derivations

Example 1 (Multiple ridge penalties) Here we present the details for deriving (4.24) for Example 1. The additive components $g_{j}(\boldsymbol{\theta}^{(j)})(\boldsymbol{x}^{(j)})$ are linear functions that are $\ell_{j}$ -Lipschitz where $\ell_{j}(\boldsymbol{x}^{(j)})=\|\boldsymbol{x}^{(j)}\|_{2}$ . Then by Lemma 1, the fitted function $g(\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda}))(\boldsymbol{x})$ satisfy Assumption 1 over $\mathbb{R}^{p}$ with

[TABLE]

where $C^{*}_{T}$ is defined in Example 1 of the main manuscript.

Example 2 (Multiple sobolev penalties) here we present the details for deriving (4.24) for Example 2 Since the solution to (4.27) must be the sum of natural cubic splines (Buja et al., 1989), we can parameterize the space using a Reproducing Kernel Hilbert Space with inner product

[TABLE]

and the reproducing kernel

[TABLE]

(Heckman et al., 2012). Then one can instead solve for (4.27) over the functions $g$ of the form

[TABLE]

where the functions $g_{j}$ are split into a linear component and an orthogonal non-linear component

[TABLE]

For notational simplicity, we will also denote $\vec{R}(x|D)_{ij}=R(x_{ij},x_{j})$ . We will also write

[TABLE]

Using this finite-dimensional representation, we find that

[TABLE]

where the matrix $K_{j}$ has elements $K_{j,(u,v)}=R(x_{uj},x_{vj}).$ Since any $g_{j}$ with non-zero $\boldsymbol{\theta}_{j}$ will have a positive Sobolev penalty, then the matrix $K_{j}$ must be positive definite. Using the formulation above, we re-express (4.27) as the finite-dimensional problem

[TABLE]

where $K=(K_{1}...K_{J})$ . In order to make the fitted functions $\hat{g}_{j}$ identifiable, we add the usual constraint that $\sum_{i=1}^{n_{T}}g_{j}(x_{ij})=0$ for all $j$ . We also assume that $X_{T}^{\top}X_{T}$ is nonsingular to ensure that there is a unique $\hat{\alpha}_{1}$ .

The KKT conditions then gives us

[TABLE]

where $K^{(1/2)}=(K_{1}^{1/2}...K_{J}^{1/2})$ , $I$ is the $n_{T}\times n_{T}$ identity matrix, and $P_{X_{T}}^{\top}=I-X_{T}(X_{T}^{\top}X_{T})^{-1}X_{T}^{\top}$ .

To apply Theorem 1, we need to characterize how $\hat{g}(\boldsymbol{\lambda})(\cdot)$ varies with $\boldsymbol{\lambda}$ . Since we have the closed form solution to (A.153), we use it to directly bound the Lipschitz factor $C_{\Lambda}(\boldsymbol{x}|D^{(n_{T})})$ . From Green and Silverman (1993), we know that the value of the cubic $\hat{g}_{j}$ on the interval $[t_{L},t_{R}]$ can be defined using its values and second derivatives at the ends of the interval. Let $h=t_{R}-t_{L}$ . Then the value of the cubic

[TABLE]

Let $\hat{\boldsymbol{\gamma}}_{j}$ be the vector of second derivatives of $\hat{g}^{\prime\prime}_{j,\perp}$ for observations in the training data. Since the fitted functions $\hat{g}_{j,\perp}$ must be natural cubic splines, $\hat{\boldsymbol{\gamma}}_{j}$ and $\hat{\boldsymbol{\theta}}_{j}$ have a linear relationship:

[TABLE]

where the matrix $R_{j}$ is a banded diagonally dominant matrix and $Q_{j}$ is a banded negative-semi-definite matrix that depend on the covariates $x_{j}$ in the training data. For the definitions of $R_{j}$ and $Q_{j}$ , refer to Green and Silverman (1993). Let $h_{j}(D^{(n_{T})})$ be the smallest distance between observations of the $j$ th covariates in the training data $T$ . Then using the Gershgorin circle theorem (gershgorin1931uber), one can show that all the eigenvalues of $R_{j}$ are larger than $\frac{1}{3}h_{j}(D^{(n_{T})})$ and all the eigenvalues of $Q_{j}$ have magnitudes no greater than $4/h_{j}(D^{(n_{T})})$ . Thus using (A.154) and (A.155), we have that

[TABLE]

for some absolute constant $c>0$ . To bound the second term on the right hand side, we know from (A.153) that

[TABLE]

if $\ell=j$ . Otherwise $\nabla_{\lambda_{\ell}}K_{j}\hat{\boldsymbol{\theta}}_{j}(\boldsymbol{\lambda})=0$ . Thus

[TABLE]

The eigenvalues of $K_{j}$ are bounded above by the largest row sum, which is no more than $2n_{T}$ (assuming all training covariates are between 0 and 1). Putting the results above together, we have

[TABLE]

Also, we have from (A.152) that

[TABLE]

Finally we can conclude that

[TABLE]

By triangle inequality, we get the Lipschitz factor for the fitted model $\hat{g}$ by summing up (A.164) for $j=1,..,J$ . We find that the Lipschitz factor in (4.24) is

[TABLE]

Example 3 (Multiple elastic nets, training-validation split) Here we check that all the conditions for Lemma 2 are satisfied.

First we check Condition 1. Since the absolute value function $|\cdot|$ is twice-continuously differentiable everywhere except at zero, the directional derivatives of $||\boldsymbol{\theta}^{(j)}||_{1}$ at $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ only exist along directions spanned by the columns of $\boldsymbol{I}_{I^{(j)}(\boldsymbol{\lambda})}$ . Thus the penalized training loss $L_{T}(\cdot,\boldsymbol{\lambda})$ is twice differentiable with respect to the directions in

[TABLE]

Moreover, the elastic net solution paths are piecewise linear (Zou and Hastie, 2003). This implies that the nonzero indices of the elastic net estimates stay locally constant for almost every $\boldsymbol{\lambda}$ ; so (A.166) is also a local optimality space for $L_{T}(\cdot,\boldsymbol{\lambda})$ . In addition, this implies that Condition 2 is satisfied.

We also check that the Hessian of the penalized training loss has a minimum eigenvalue bounded away from zero. Consider the following orthogonal basis of (A.166) at $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ : $U(\boldsymbol{\lambda})=\{U^{(j)}(\boldsymbol{\lambda})\}_{j=1}^{J}$ where

[TABLE]

The Hessian matrix of $L_{T}(\cdot,\boldsymbol{\lambda})$ with respect to directions $U(\boldsymbol{\lambda})$ is

[TABLE]

where $\boldsymbol{X}_{T}=(\boldsymbol{X}^{(1)}...\boldsymbol{X}^{(J)})$ and $\boldsymbol{I}$ is the identity matrix with length equal to the number of nonzero elements in $\hat{\boldsymbol{\theta}}(\boldsymbol{\lambda})$ . Since the first summand is positive semi-definite and $\lambda_{1}>\lambda_{\min}$ , (A.168) has a minimum eigenvalue of $\lambda_{\min}w$ .

Example 4 (Multiple elastic nets, cross-validation) Here we present details for establishing an oracle inequality when multiple elastic net penalties are tuned via the averaged version of $K$ -fold cross-validation. First we check the conditions in Theorem 2 are satisfied. In the problem setup, $X$ is a log-concave vector and $\sup_{\|a\|_{\infty}=1}\left\|X^{\top}a\right\|_{L_{\psi_{2}}}<c_{R}<\infty$ for some constant $c_{R}$ . Using a similar procedure as Lecué and Mitchell (2012), we can then show that (3.14) and (3.15) in Assumption 2 are satisfied with $K_{0}\coloneqq(\|\boldsymbol{\theta}^{*}\|_{\infty}+K_{0}^{\prime})c_{R}$ .

Next we find the Lipschitz factor. We can upper bound the Lipschitz factor of the thresholded model with the Lipschitz factor of the un-thresholded model. So Assumption 1 is satisfied over $\mathbb{R}^{p}$ with

[TABLE]

Finally, to apply Theorem 2, we must find a bound for (3.16). Let $\sigma_{0}=O_{p}(n^{4t_{\min}}R^{4}Jp/w^{2})$ . Using the fact that $\left\|C_{\Lambda}(\cdot|D^{(n_{T})})\right\|_{L_{\psi_{2}}}^{2}$ is a linear function of $\|\epsilon\|_{D^{(n_{T})}}^{2}$ , which is a sub-exponential random variable, we have that

[TABLE]

for constants $c_{0},c_{1}>0$ . Plugging in this bound to Theorem 2 gives us our desired result.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arlot et al. [2010] Sylvain Arlot, Alain Celisse, et al. A survey of cross-validation procedures for model selection. Statistics surveys , 4:40–79, 2010.
2Györfi et al. [2006] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression . Springer Science & Business Media, 2006.
3Van Der Laan and Dudoit [2003] Mark J Van Der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. 2003.
4van der Laan et al. [2004] Mark J van der Laan, Sandrine Dudoit, and Sunduz Keles. Asymptotic optimality of likelihood-based cross-validation. Statistical Applications in Genetics and Molecular Biology , 3(1):1–23, 2004.
5Lecué and Mitchell [2012] Guillaume Lecué and Charles Mitchell. Oracle inequalities for cross-validation type procedures. Electronic Journal of Statistics , 6:1803–1837, 2012.
6Bengio [2000] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation , 12(8):1889–1900, 2000.
7Zou and Hastie [2003] Hui Zou and Trevor Hastie. Regression shrinkage and selection via the elastic net. Journal of the Royal Statistical Society: Series B. v 67 , pages 301–320, 2003.
8Simon et al. [2013] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso. Journal of Computational and Graphical Statistics , 22(2):231–245, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

2 Introduction

3 Oracle Inequalities

Assumption 1**.**

3.1 A Single Training/Validation Split

Theorem 1**.**

Corollary 1**.**

3.2 Cross-Validation

Assumption 2**.**

Theorem 2**.**

4 Penalized regression models

4.1 Parametric additive models

4.1.1 Parametric regression with smooth penalties

Lemma 1**.**

Example 1** (Multiple ridge penalties).**

Example 2** (Multiple sobolev penalties).**

4.1.2 Parametric regression with non-smooth penalties

Definition 1**.**

Definition 2**.**

Condition 1**.**

Condition 2**.**

Lemma 2**.**

Example 3** (Multiple elastic nets, training/validation split).**

Example 4** (Multiple elastic nets, cross-validation).**

4.2 Nonparametric additive models

Lemma 3**.**

5 Simulations

6 Discussion

Appendix A Supplementary Materials

A.1 A single training/validation split

Lemma 4**.**

Proof.

Definition 3**.**

Lemma 5**.**

Theorem 3**.**

Proof.

Lemma 6**.**

Proof.

A.1.1 Proof for Theorem 1

Proof.

A.2 Cross-validation

Definition 4**.**

Assumption 3**.**

Lemma 7**.**

Lemma 8**.**

Lemma 9**.**

Lemma 10**.**

Theorem 4**.**

Lemma 11**.**

Proof.

Lemma 12**.**

Proof.

Proof for Theorem 2.

A.3 Penalized regression for additive models

A.3.1 Proof for Lemma 1

Proof.

A.3.2 Proof for Lemma 2

Lemma 13**.**

Proof.

Lemma 14**.**

Proof.

Proof for Lemma 2.

A.3.3 Proof for Lemma 3

Proof.

A.4 Examples: detailed derivations

Assumption 1.

Theorem 1.

Corollary 1.

Assumption 2.

Theorem 2.

Lemma 1.

Example 1 (Multiple ridge penalties).

Example 2 (Multiple sobolev penalties).

Definition 1.

Definition 2.

Condition 1.

Condition 2.

Lemma 2.

Example 3 (Multiple elastic nets, training/validation split).

Example 4 (Multiple elastic nets, cross-validation).

Lemma 3.

Lemma 4.

Definition 3.

Lemma 5.

Theorem 3.

Lemma 6.

Definition 4.

Assumption 3.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Theorem 4.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.