Two models of double descent for weak features

Mikhail Belkin; Daniel Hsu; Ji Xu

arXiv:1903.07571·cs.LG·December 22, 2020

Two models of double descent for weak features

Mikhail Belkin, Daniel Hsu, Ji Xu

PDF

TL;DR

This paper provides a mathematical analysis of the double descent risk curve in simple data models, revealing how prediction risk peaks near the sample size and then decreases with more features, contrasting with prescient models.

Contribution

It introduces two models of double descent, offering a precise mathematical understanding of the risk curve in least squares/least norm predictors.

Findings

01

Risk peaks when features are near sample size

02

Risk decreases as features exceed sample size

03

Contrasts with prescient feature selection models

Abstract

The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features $p$ is close to the sample size $n$ , but also that the risk decreases towards its minimum as $p$ increases beyond $n$ . This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.

Equations102

y = x^{*} β + σ ϵ = j = 1 \sum D x_{j} β_{j} + σ ϵ .

y = x^{*} β + σ ϵ = j = 1 \sum D x_{j} β_{j} + σ ϵ .

\hat{β}_{T} := X_{T}^{†} y, \hat{β}_{T^{c}} := 0 .

\hat{β}_{T} := X_{T}^{†} y, \hat{β}_{T^{c}} := 0 .

E [(y - x^{*} \hat{β})^{2}] = ⎩ ⎨ ⎧ (∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot (1 + \frac{p}{n - p - 1}) + \infty ∥ β_{T} ∥^{2} \cdot (1 - \frac{n}{p}) + (∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot (1 + \frac{n}{p - n - 1}) if p \leq n - 2; if n - 1 \leq p \leq n + 1; if p \geq n + 2 .

E [(y - x^{*} \hat{β})^{2}] = ⎩ ⎨ ⎧ (∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot (1 + \frac{p}{n - p - 1}) + \infty ∥ β_{T} ∥^{2} \cdot (1 - \frac{n}{p}) + (∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot (1 + \frac{n}{p - n - 1}) if p \leq n - 2; if n - 1 \leq p \leq n + 1; if p \geq n + 2 .

E [(y - x^{*} \hat{β})^{2}] = ⎩ ⎨ ⎧ ((1 - \frac{p}{D}) \cdot ∥ β ∥^{2} + σ^{2}) \cdot (1 + \frac{p}{n - p - 1}) ∥ β ∥^{2} \cdot (1 - \frac{n}{D} \cdot (2 - \frac{D - n - 1}{p - n - 1})) + σ^{2} \cdot (1 + \frac{n}{p - n - 1}) if p \leq n - 2; if p \geq n + 2 .

E [(y - x^{*} \hat{β})^{2}] = ⎩ ⎨ ⎧ ((1 - \frac{p}{D}) \cdot ∥ β ∥^{2} + σ^{2}) \cdot (1 + \frac{p}{n - p - 1}) ∥ β ∥^{2} \cdot (1 - \frac{n}{D} \cdot (2 - \frac{D - n - 1}{p - n - 1})) + σ^{2} \cdot (1 + \frac{n}{p - n - 1}) if p \leq n - 2; if p \geq n + 2 .

E [∥ β_{T} ∥^{2}] = \frac{p}{D} \cdot ∥ β ∥^{2}, E [∥ β_{T^{c}} ∥^{2}] = (1 - \frac{p}{D}) \cdot ∥ β ∥^{2} .

E [∥ β_{T} ∥^{2}] = \frac{p}{D} \cdot ∥ β ∥^{2}, E [∥ β_{T^{c}} ∥^{2}] = (1 - \frac{p}{D}) \cdot ∥ β ∥^{2} .

E [(y - x^{*} \hat{β})^{2}] = σ^{2} + ∥ β - \hat{β} ∥^{2} = σ^{2} + ∥ β_{T^{c}} - \hat{β}_{T^{c}} ∥^{2} + ∥ β_{T} - \hat{β}_{T} ∥^{2} .

E [(y - x^{*} \hat{β})^{2}] = σ^{2} + ∥ β - \hat{β} ∥^{2} = σ^{2} + ∥ β_{T^{c}} - \hat{β}_{T^{c}} ∥^{2} + ∥ β_{T} - \hat{β}_{T} ∥^{2} .

E [(y - x^{*} \hat{β})^{2}] = σ^{2} + ∥ β_{T^{c}} ∥^{2} + E [∥ β_{T} - \hat{β}_{T} ∥^{2}] .

E [(y - x^{*} \hat{β})^{2}] = σ^{2} + ∥ β_{T^{c}} ∥^{2} + E [∥ β_{T} - \hat{β}_{T} ∥^{2}] .

E [(y - x^{*} \hat{β})^{2}] = {(∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot (1 + \frac{p}{n - p - 1}) + \infty if p \leq n - 2; if p \in {n - 1, n} .

E [(y - x^{*} \hat{β})^{2}] = {(∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot (1 + \frac{p}{n - p - 1}) + \infty if p \leq n - 2; if p \in {n - 1, n} .

β_{T} - \hat{β}_{T}

β_{T} - \hat{β}_{T}

= β_{T} - X_{T}^{*} (X_{T} X_{T}^{*})^{†} (X_{T} β_{T} + η)

= (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} - X_{T}^{*} (X_{T} X_{T}^{*})^{†} η .

∥ β_{T} - \hat{β}_{T} ∥^{2} = ∥ (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} ∥^{2} + ∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2} .

∥ β_{T} - \hat{β}_{T} ∥^{2} = ∥ (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} ∥^{2} + ∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2} .

∥ (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} ∥^{2} = ∥ β_{T} ∥^{2} - ∥ Π_{T} β_{T} ∥^{2} .

∥ (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} ∥^{2} = ∥ β_{T} ∥^{2} - ∥ Π_{T} β_{T} ∥^{2} .

E [∥ Π_{T} β_{T} ∥^{2}] = ∥ β_{T} ∥^{2} \cdot \frac{n}{p} .

E [∥ Π_{T} β_{T} ∥^{2}] = ∥ β_{T} ∥^{2} \cdot \frac{n}{p} .

E [∥ (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} ∥^{2}] = ∥ β_{T} ∥^{2} \cdot (1 - \frac{n}{p}) .

E [∥ (I - X_{T}^{*} (X_{T} X_{T}^{*})^{†} X_{T}) β_{T} ∥^{2}] = ∥ β_{T} ∥^{2} \cdot (1 - \frac{n}{p}) .

∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2} = tr ((X_{T} X_{T}^{*})^{†} (X_{T} X_{T}^{*}) (X_{T} X_{T}^{*})^{†} η η^{*}) = tr ((X_{T} X_{T}^{*})^{†} η η^{*})

∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2} = tr ((X_{T} X_{T}^{*})^{†} (X_{T} X_{T}^{*}) (X_{T} X_{T}^{*})^{†} η η^{*}) = tr ((X_{T} X_{T}^{*})^{†} η η^{*})

E [∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2}] = tr (E [(X_{T} X_{T}^{*})^{†}] E [η η^{*}]) .

E [∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2}] = tr (E [(X_{T} X_{T}^{*})^{†}] E [η η^{*}]) .

E [η η^{*}] = (∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot I .

E [η η^{*}] = (∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot I .

tr (E [(X_{T} X_{T}^{*})^{†}]) = {\frac{n}{p - n - 1} + \infty if p \geq n + 2; if p \in {n, n + 1} .

tr (E [(X_{T} X_{T}^{*})^{†}]) = {\frac{n}{p - n - 1} + \infty if p \geq n + 2; if p \in {n, n + 1} .

E [∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2}] = {(∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot \frac{n}{p - n - 1} + \infty if p \geq n + 2; if p \in {n, n + 1} .

E [∥ X_{T}^{*} (X_{T} X_{T}^{*})^{†} η ∥^{2}] = {(∥ β_{T^{c}} ∥^{2} + σ^{2}) \cdot \frac{n}{p - n - 1} + \infty if p \geq n + 2; if p \in {n, n + 1} .

∥ β - \hat{β} ∥^{2}

∥ β - \hat{β} ∥^{2}

1 - 2 exp (- \frac{p ϵ ^{4} ( α ^{- 1} - 1 ) ^{2}}{24 (( 2 - ϵ ) α ^{- 1} + ϵ ) ^{2}}) - 2 exp (- \frac{p ( 1 - ϵ ) ^{2} ( α ^{- 1} - 1 ) ^{2}}{2}) - 2 p exp (- \frac{p ( α ^{- 1} - 1 ) ϵ ^{2}}{24}) .

1 - 2 exp (- \frac{p ϵ ^{4} ( α ^{- 1} - 1 ) ^{2}}{24 (( 2 - ϵ ) α ^{- 1} + ϵ ) ^{2}}) - 2 exp (- \frac{p ( 1 - ϵ ) ^{2} ( α ^{- 1} - 1 ) ^{2}}{2}) - 2 p exp (- \frac{p ( α ^{- 1} - 1 ) ϵ ^{2}}{24}) .

∥ β - \hat{β} ∥^{2}

∥ β - \hat{β} ∥^{2}

1 - 2 exp (- \frac{n ϵ ^{2}}{12}) - 2 exp (- \frac{n ϵ ^{4} ( α - 1 ) ^{2}}{24 (( 2 - ϵ ) α + ϵ ) ^{2}}) - 2 exp (- \frac{n ( 1 - ϵ ) ^{2} ( α - 1 ) ^{2}}{2}) - 2 n exp (- \frac{n ( α - 1 ) ϵ ^{2}}{24}) .

1 - 2 exp (- \frac{n ϵ ^{2}}{12}) - 2 exp (- \frac{n ϵ ^{4} ( α - 1 ) ^{2}}{24 (( 2 - ϵ ) α + ϵ ) ^{2}}) - 2 exp (- \frac{n ( 1 - ϵ ) ^{2} ( α - 1 ) ^{2}}{2}) - 2 n exp (- \frac{n ( α - 1 ) ϵ ^{2}}{24}) .

∥ β_{T} - \hat{β}_{T} ∥^{2}

∥ β_{T} - \hat{β}_{T} ∥^{2}

∥ β_{T} ∥^{2} - \frac{p}{D} ∥ β ∥^{2} = ∥ β_{T^{c}} ∥^{2} - (1 - \frac{p}{D}) ∥ β ∥^{2}

∥ β_{T} ∥^{2} - \frac{p}{D} ∥ β ∥^{2} = ∥ β_{T^{c}} ∥^{2} - (1 - \frac{p}{D}) ∥ β ∥^{2}

F_{i, j} = \frac{1}{D} ω^{(i - 1) (j - 1)},

F_{i, j} = \frac{1}{D} ω^{(i - 1) (j - 1)},

\hat{β}_{S} := F_{S, T}^{†} μ_{S}, \hat{β}_{S^{c}} := 0 .

\hat{β}_{S} := F_{S, T}^{†} μ_{S}, \hat{β}_{S^{c}} := 0 .

F_{S, T}^{†} = {F_{S, T}^{*} (F_{S, T} F_{S, T}^{*})^{- 1}, (F_{S, T}^{*} F_{S, T})^{- 1} F_{S, T}^{*}, ∣ T ∣ \geq ∣ S ∣ ∣ T ∣ \leq ∣ S ∣ .

F_{S, T}^{†} = {F_{S, T}^{*} (F_{S, T} F_{S, T}^{*})^{- 1}, (F_{S, T}^{*} F_{S, T})^{- 1} F_{S, T}^{*}, ∣ T ∣ \geq ∣ S ∣ ∣ T ∣ \leq ∣ S ∣ .

E [β β^{*}] = \frac{1}{D} \cdot I

E [β β^{*}] = \frac{1}{D} \cdot I

∥ β - \hat{β} ∥^{2}

∥ β - \hat{β} ∥^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Two models of double descent for weak features

Mikhail Belkin

Halıcıoğlu Data Science Institute, UC San Diego, La Jolla, CA

Daniel Hsu

Department of Computer Science, Columbia University, New York, NY

Data Science Institue, Columbia University, New York, NY

Ji Xu

Department of Computer Science, Columbia University, New York, NY

Abstract

The “double descent” risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features $p$ is close to the sample size $n$ , but also that the risk decreases towards its minimum as $p$ increases beyond $n$ . This behavior is contrasted with that of “prescient” models that select features in an a priori optimal order.

††footnotetext: E-mail: [email protected], [email protected], [email protected]

1 Introduction

The “double descent” risk curve was proposed by [Bel+19] as a general way to qualitatively describe the out-of-sample prediction performance of variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance trade-off with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications (see Section 1.1 for references). In these studies, a predictive model with $p$ parameters is fit to a training sample of size $n$ , and the test risk (i.e., out-of-sample error) is examined as a function of $p$ . When $p$ is below the sample size $n$ (for regression or binary classification), the test risk is governed by the usual bias-variance decomposition. As $p$ is increased towards $n$ , the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up, sometimes toward infinity. The classical bias-variance analysis identifies a “sweet spot” value of $p\in[0,n]$ at which the bias and variance are balanced to achieve low test risk. However, in the “modern regime”, as $p$ grows beyond $n$ , the training risk remains zero, but the test risk decreases again, even when fitting noisy data, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). In many (but not all) cases from [Bel+19], the limiting risk as $p\to\infty$ is lower than what is achieved at the “sweet spot” value of $p$ .

In this article, we show that key aspects of the “double descent” risk curve can be observed with the least squares/least norm predictor in two simple random features models. The first is a Gaussian model studied by [BF83] in the classical $p\leq n$ regime, while the second is a Fourier series model for functions on the circle. In both cases, we prove that the risk is infinite around $p=n$ , and decreases again as $p$ increases beyond $n$ . When the signal-to-noise ratio is high, the minimum risk is, in fact, achieved in the modern regime, when $p>n$ . Our results provide a precise mathematical analysis in a simple and tractable setting of the mechanism that was qualitatively described by [Bel+19]. In particular, it captures a key aspect of many practical over-parameterized models: that increasing the number of parameters to the maximum can lead to better performance. We also establish some non-asymptotic concentration phenomena in the Gaussian model.

We note that in both of the models, the features are selected randomly, which makes them useful for studying scenarios where features are plentiful but individually too “weak” to be selected in an informed manner. Such scenarios are commonplace in machine learning practice, and they should be contrasted with “scientific” scenarios where features are carefully designed or curated, as is often the case in scientific applications. For comparison, we give an example of “prescient” feature selection, where the $p$ features a priori known to be most useful are included in the model. In this case, the optimal test risk is achieved at some $p\leq n$ , which is consistent with the classical analysis of [BF83].

1.1 Related and concurrent works

The “double descent” risk curve was posited by [Bel+19] to connect the classical bias-variance trade-off to behaviors observed in over-parameterized regimes for a variety of machine learning models. The shape and features of the risk curve itself appear throughout in the literature in a number of contexts [[, e.g.,]]vallet1989linear,opper1990ability,le1991eigenvalues,krogh1992generalization,bos1998dynamics,watkin1993statistical,advani2017high; see also [Loo+20] for a “brief prehistory” that focuses on the curious peak in the curve. These prior works analyze the risk of linear classification and regression models and neural networks in high-dimensional asymptotic regimes. Our analysis in the Gaussian model gives an exact expression for the risk for any finite sample size and number of parameters.

More recently, [Nea+18] observe that similar phenomena in neural networks can be explained by a variance reduction effect of increasing network width. The transition from under- to over-parametrized regimes was recently analyzed by [Spi+18] by drawing a connection to the physical phenomenon of “jamming” in a class of glassy systems. Our analysis makes these ideas concrete and explicit in the context of simple regression models. For instance, our analysis captures the transition from under- to over-parameterized regimes at a point where an inverse Wishart random matrix has no finite expectation. It also allows us to compare the risks at any points in the curve and explain how the risk in the over-parameterized regime can be lower than any risk in the under-parameterized regime.

The initial version of this article [BHX19] appeared concurrently with the works of [Has+19], [Mut+20], and [Bar+20], all of which also study the behavior of the least squares/least norm predictor in over-parameterized linear regression. [Mut+20] focus on the well-specified scenario (essentially, $p=D$ ) and provide upper-bounds on the risk that go to zero as $p\to\infty$ . (A related variance analysis was carried out by [Nea+18].) [Has+19] provide a much broader range of analyses in the high-dimensional asymptotic regime, including a “misspecified” setup that is related to ours. Their analyses require weaker distributional assumptions than ours, owing to their reliance on asymptotic analysis. (A special case of the results in the follow-up work by [XH19] further broadens the range of analyses to allow highly non-isotropic designs, but again only in the high-dimensional asymptotic regime.) The analysis of [Has+19] also considers the effect of ridge regularization; in particular, they show that when the optimal level of regularization is used, the risk curve no longer shows the “double descent” shape. Finally, [Bar+20] study non-asymptotic upper and lower bounds on the risk in the over-parameterized regime, and provide a characterization in terms of certain “effective dimensions” based on the tail of the eigenvalue sequence of the covariance operator.

2 Gaussian model

We consider a regression problem where the response $y$ is equal to a linear function ${\boldsymbol{\beta}}=(\beta_{1},\dotsc,\beta_{D})\in{\mathbb{R}}^{D}$ of $D$ real-valued variables ${\boldsymbol{x}}=(x_{1},\dotsc,x_{D})$ plus noise $\sigma\epsilon$ :

[TABLE]

Given $n$ iid copies $(({\boldsymbol{x}}^{(i)},y^{(i)}))_{i=1}^{n}$ of $({\boldsymbol{x}},y)$ , we fit a linear model to the data only using a subset ${T}\subseteq[D]:=\{1,\dotsc,D\}$ of $p:=|{T}|$ variables.

Let ${\boldsymbol{X}}:=[{\boldsymbol{x}}^{(1)}|\dotsb|{\boldsymbol{x}}^{(n)}]^{*}$ be the $n\times D$ design matrix, and let ${\boldsymbol{y}}:=(y^{(1)},\dotsc,y^{(n)})$ be the vector of responses. For a subset $A\subseteq[D]$ and a $D$ -dimensional vector ${\boldsymbol{v}}$ , we use ${\boldsymbol{v}}_{A}:=(v_{j}:j\in A)$ to denote its $|A|$ -dimensional subvector of entries from $A$ ; we also use ${\boldsymbol{X}}_{A}:=[{\boldsymbol{x}}_{A}^{(1)}|\dotsb|{\boldsymbol{x}}_{A}^{(n)}]^{*}$ to denote the $n\times|A|$ design matrix with variables from $A$ . For $A\subseteq[D]$ , we denote its complement by $A^{c}:=[D]\setminus A$ . Finally, $\|\cdot\|$ denotes the Euclidean norm.

We fit regression coefficients $\hat{\boldsymbol{\beta}}=(\hat{\beta}_{1},\dotsc,\hat{\beta}_{D})$ with

[TABLE]

Above, the symbol † denotes the Moore-Penrose pseudoinverse. In other words, we use the solution to the normal equations ${\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}{\boldsymbol{v}}={\boldsymbol{X}}_{T}^{*}{\boldsymbol{y}}$ of least norm for $\hat{\boldsymbol{\beta}}_{T}$ and force $\hat{\boldsymbol{\beta}}_{{T}^{c}}$ to all-zeros.

In this section, our analysis assumes a model in which $({\boldsymbol{x}},\epsilon)$ follows a standard multivariate Gaussian distribution. This Gaussian model was also studied by [BF83], although their analysis is restricted to the case where the number of variables used $p$ is always at most $n$ ; our analysis will also consider the $p\geq n$ regime.

2.1 Prediction risk

We derive a formula for the (prediction) risk of $\hat{\boldsymbol{\beta}}$ for an arbitrary choice of $p$ features ${T}\subseteq[D]$ , and then examine this risk under particular selection models for ${T}$ .

Theorem 1.

Assume the distribution of ${\boldsymbol{x}}$ is the standard normal in ${\mathbb{R}}^{D}$ , $\epsilon$ is a standard normal random variable independent of ${\boldsymbol{x}}$ , and $y={\boldsymbol{x}}^{*}{\boldsymbol{\beta}}+\sigma\epsilon$ for some ${\boldsymbol{\beta}}\in{\mathbb{R}}^{D}$ and $\sigma>0$ . Pick any $p\in\{0,\dotsc,D\}$ and ${T}\subseteq[D]$ of cardinality $p$ . The risk of $\hat{\boldsymbol{\beta}}$ , where $\hat{\boldsymbol{\beta}}_{T}={\boldsymbol{X}}_{T}^{\dagger}{\boldsymbol{y}}$ and $\hat{\boldsymbol{\beta}}_{{T}^{c}}={\boldsymbol{0}}$ , is

[TABLE]

The proof of Theorem 1 is not hard, we give the details in Section 2.2. We now turn to the risk of $\hat{\boldsymbol{\beta}}$ under a random selection model for ${T}$ .

Corollary 1.

Let ${T}$ be a uniformly random subset of $[D]$ of cardinality $p$ . In the setting of Theorem 1, the risk of $\hat{\boldsymbol{\beta}}$ (taking expectation with respect to the random choice of ${T}$ in addition to the random design matrix and response vector) satisfies

[TABLE]

Proof.

Since ${T}$ is a uniformly random subset of $[D]$ of cardinality $p$ ,

[TABLE]

Plugging into Theorem 1 completes the proof. ∎

Thus, assuming $D>n+1$ , we observe that the risk first increases with $p$ up to the “interpolation threshold” ( $p=n$ ), after which the risk decreases with $p$ . Moreover, when the signal-to-noise ratio $\|{\boldsymbol{\beta}}\|^{2}/\sigma^{2}$ is larger than $D/(D-n-1)$ , the risk is smallest at $p=D$ ; in particular, it is smaller than the risk at any $p\leq n$ . This is the “double descent” risk curve where the first “descent” is degenerate (i.e., the “sweet spot” that balances bias and variance is at $p=0$ ). See Figure 1 for an illustration.

It is worth pointing out that the behavior under the random selection model of ${T}$ can be very different from that under a deterministic model of ${T}$ . Consider including variables in ${T}$ by decreasing order of $\beta_{j}^{2}$ —a kind of “prescient” selection model studied by [BF83]. The behavior of the risk as a function of $p$ , illustrated in Figure 2, reveals a striking difference between the random selection model and the “prescient” selection model.

2.2 Proof of Theorem 1

Recall that ${\boldsymbol{x}}$ is assumed to follow a standard normal distribution in ${\mathbb{R}}^{D}$ . Since ${\boldsymbol{x}}$ is isotropic (i.e., zero mean and identity covariance), the mean squared prediction error of any ${\boldsymbol{\beta}}^{\prime}\in{\mathbb{R}}^{D}$ can be written as

[TABLE]

Since $\hat{\boldsymbol{\beta}}_{{T}^{c}}={\boldsymbol{0}}$ , it follows that the risk of $\hat{\boldsymbol{\beta}}$ is

[TABLE]

Classical regime.

The risk of $\hat{\boldsymbol{\beta}}$ was computed by [BF83] in the regime where $p\leq n$ :

[TABLE]

Interpolating regime.

We consider the regime where $p\geq n$ . Recall that the pseudoinverse of ${\boldsymbol{X}}_{T}$ can be written as ${\boldsymbol{X}}_{T}^{{\dagger}}={\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}$ . Thus, letting ${\boldsymbol{\eta}}:={\boldsymbol{y}}-{\boldsymbol{X}}_{T}{\boldsymbol{\beta}}_{T}$ ,

[TABLE]

On the right hand side, the first term $({\boldsymbol{I}}-{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{X}}_{T}){\boldsymbol{\beta}}_{T}$ is the orthogonal projection of ${\boldsymbol{\beta}}_{T}$ onto the null space of ${\boldsymbol{X}}_{T}$ , while the second term $-{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}$ is a vector in the row space of ${\boldsymbol{X}}_{T}$ . By the Pythagorean theorem, the squared norm of their sum is equal to the sum of their squared norms, so

[TABLE]

We analyze the expected values of these two terms by exploiting properties of the standard normal distribution.

First term.

Note that ${\boldsymbol{\Pi}}_{T}:={\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{X}}_{T}$ is the orthogonal projection matrix for the row space of ${\boldsymbol{X}}_{T}$ . So, by the Pythagorean theorem, we have

[TABLE]

By rotational symmetry of the standard normal distribution, it follows that

[TABLE]

Therefore

[TABLE]

Second term.

We use the “trace trick” to write

[TABLE]

where the second equality holds almost surely because ${\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*}$ is almost surely invertible. Since ${\boldsymbol{x}}_{T}^{*}{\boldsymbol{\beta}}_{T}$ and ${\boldsymbol{x}}_{{T}^{c}}^{*}{\boldsymbol{\beta}}_{{T}^{c}}+\sigma\epsilon$ are uncorrelated, it follows that

[TABLE]

The distribution of ${\boldsymbol{\eta}}$ is normal with mean zero and covariance $(\|{\boldsymbol{\beta}}_{{T}^{c}}\|^{2}+\sigma^{2})\cdot{\boldsymbol{I}}\in{\mathbb{R}}^{n\times n}$ , so

[TABLE]

The distribution of ${\boldsymbol{P}}:=({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}$ is inverse-Wishart with identity scale matrix ${\boldsymbol{I}}\in{\mathbb{R}}^{n\times n}$ and $p$ degrees-of-freedom. Each diagonal entry $P_{i,i}$ of ${\boldsymbol{P}}$ , for $i=1,\dotsc,n$ , has a reciprocal that follows the $\chi^{2}$ distribution with $p-n+1$ degrees-of-freedom. Hence ${\mathbb{E}}[P_{i,i}]=1/(p-n-1)$ if $p\geq n+2$ and ${\mathbb{E}}[P_{i,i}]=+\infty$ if $p\in\{n,n+1\}$ . Therefore

[TABLE]

We conclude that

[TABLE]

Combining the first and second terms gives the claimed expression for the risk. ∎

2.3 Concentration

We briefly consider the measure concentration of $\|{\boldsymbol{\beta}}-\hat{\boldsymbol{\beta}}\|^{2}$ .

Theorem 2.

Consider the setting from Theorem 1, and fix any $\epsilon\in(0,1)$ . If $\alpha:=p/n<1$ , then

[TABLE]

with probability at least

[TABLE]

If $\alpha>1$ , then

[TABLE]

with probability at least

[TABLE]

The proof is given in Appendix A. The main idea for the $p>n$ case is as follows. From the proof of Theorem 1, we have the decomposition

[TABLE]

The first term $\|({\boldsymbol{I}}-{\boldsymbol{\Pi}}_{T}){\boldsymbol{\beta}}_{T}\|^{2}$ is the squared distance from ${\boldsymbol{\beta}}_{T}$ to a uniformly random $n$ -dimensional subspace of ${\mathbb{R}}^{p}$ . This squared distance has the same distribution as the squared distance from a uniformly random vector of length $\|{\boldsymbol{\beta}}_{T}\|$ to a fixed $n$ -dimensional subspace of ${\mathbb{R}}^{p}$ . Thus measure concentration on the unit sphere can be used here. The second term $\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2}$ is a (random) quadratic form in the Gaussian random vector ${\boldsymbol{\eta}}$ . Gaussian concentration is readily applied after controlling the spectral properties of the Wishart random matrix ${\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*}$ . (The $p<n$ case is similar to the analysis of this second term.)

The same arguments can be used to give fixed-level confidence bounds; see Proposition 2 in Appendix B.

Finally, it is also possible to compare $\|{\boldsymbol{\beta}}_{T}\|^{2}$ to $(p/D)\|{\boldsymbol{\beta}}\|^{2}$ (and $\|{\boldsymbol{\beta}}_{T^{c}}\|^{2}$ to $(1-p/D)\|{\boldsymbol{\beta}}\|^{2}$ ) under the random selection model of $T$ from Corollary 1 using concentration inequalities for sampling without replacement [BM15, see, e.g.,]. The following is a simple consequence of Proposition 1.4 of [BM15].

Proposition 1.

For any $t>0$ , with probability at least $1-2e^{-t}$ ,

[TABLE]

where $\mu:=\max_{i\in[D]}|\beta_{i}|/\|{\boldsymbol{\beta}}\|$ .

The proof is in Appendix C. The crucial parameter $\mu$ has range $[1/\sqrt{D},1]$ . It is small when there are many relevant “weak” features, each with a relatively small coefficient in ${\boldsymbol{\beta}}$ ; conversely, it is large when ${\boldsymbol{\beta}}$ is concentrated on a sparse subset of features.

3 Fourier series model

In this section, we consider a noise-free Fourier series model, which can be regarded as a one-dimensional version of the random Fourier features model studied by [RR08] for functions defined on the unit circle.

Let ${\boldsymbol{F}}\in{\mathbb{C}}^{D\times D}$ denote the $D\times D$ discrete Fourier transform matrix: its $(i,j)$ -th entry is

[TABLE]

where $\omega:=\exp(-2\pi\mathrm{i}/D)$ is a primitive root of unity. Let ${\boldsymbol{\mu}}:={\boldsymbol{F}}{\boldsymbol{\beta}}$ for some ${\boldsymbol{\beta}}\in{\mathbb{C}}^{D}$ . Consider the following observation model:

${S}$ and ${T}$ are independent random subsets of $[D]$ . For any $i\in[D]$ , the membership of $i$ in ${S}$ (respectively, ${T}$ ) is determined by an independent Bernoulli variable with mean $\rho_{n}:=n/D$ (respectively, $\rho_{p}:=p/D$ ). 2. 2.

We observe the $n\times p$ design matrix ${\boldsymbol{F}}_{{S},{T}}$ and $n$ -dimensional vector of responses ${\boldsymbol{\mu}}_{S}$ . Here, ${\boldsymbol{F}}_{{S},{T}}$ is the submatrix of ${\boldsymbol{F}}$ with rows from ${S}$ and columns from ${T}$ , and ${\boldsymbol{\mu}}_{S}$ is the subvector of ${\boldsymbol{\mu}}$ of entries from ${S}$ .

We fit regression coefficients $\hat{\boldsymbol{\beta}}=(\hat{\beta}_{1},\dotsc,\hat{\beta}_{D})$ with

[TABLE]

One important property of the discrete Fourier transform matrix that we use is that the matrix ${\boldsymbol{F}}_{A,B}$ has rank $\min\{|A|,|B|\}$ for any $A,B\subseteq[D]$ . This is a consequence of the fact that ${\boldsymbol{F}}$ is Vandermonde. Thus, we have

[TABLE]

In the remainder of this section, we analyze the risk of $\hat{\boldsymbol{\beta}}$ under a random model for ${\boldsymbol{\beta}}$ , where

[TABLE]

(which implies ${\mathbb{E}}[\|{\boldsymbol{\beta}}\|^{2}]=1$ ). The random choice of ${\boldsymbol{\beta}}$ is independent of ${S}$ and ${T}$ . Considering the risk under this random model for ${\boldsymbol{\beta}}$ is a form of average-case analysis. For simplicity, we only consider the regime where $\rho_{p}>\rho_{n}$ .

Following the arguments from Section 2.1, we have

[TABLE]

Now we take (conditional) expectations with respect to ${\boldsymbol{\beta}}$ , given ${S}$ and ${T}$ :

[TABLE]

Since ${\boldsymbol{F}}_{{S},{T}}$ has rank $\min\{|{S}|,|{T}|\}$ , the first trace expression is equal to

[TABLE]

For the second trace expression, we use the explicit formula for ${\boldsymbol{F}}_{{S},{T}}^{\dagger}$ and the fact that ${\boldsymbol{F}}_{{S},{T}}{\boldsymbol{F}}_{{S},{T}}^{*}+{\boldsymbol{F}}_{{S},{T}^{c}}{\boldsymbol{F}}_{{S},{T}^{c}}^{*}={\boldsymbol{I}}$ to obtain

[TABLE]

where the $\lambda_{i}\in[0,1]$ are the eigenvalues of ${\boldsymbol{F}}_{{S},{T}^{c}}{\boldsymbol{F}}_{{S},{T}^{c}}^{*}$ . Therefore, from Equation 1, we have

[TABLE]

To determine the asymptotic behavior of $(*)$ , we use a recent result of [Far11]:

[TABLE]

as $D,n,p\to\infty$ with $\rho_{n}=n/D$ and $\rho_{p}=p/D$ held fixed. Further, under this limit, we have

[TABLE]

since $\rho_{p}\geq\rho_{n}$ . Hence we have the following:

Theorem 3.

Assume the setting as above, with $D,n,p\to\infty$ and $\rho_{n}=n/D$ and $\rho_{p}=p/D$ held fixed. Then

[TABLE]

Note that the right-hand side in the equation from Theorem 3 is well-defined in the limit because the ratios $\rho_{n},\rho_{p}$ are fixed. It diverges to $+\infty$ when $\rho_{p}$ is close to $\rho_{n}$ , and decreases as $\rho_{p}$ approaches $1$ . This is the same behavior as in the Gaussian model from Section 2 with random feature selection; we depict a non-asymptotic instantiation of it in Figure 3.

4 Discussion

Our analysis shows that when features are chosen in an uninformed manner, it may be optimal to choose as many as possible—even more than the number of data—rather than limit the number to that which balances bias and variance as suggested by classical analyses. This choice is simple, both conceptually and algorithmically (although it may incur a computational penalty for processing large numbers of parameters), and avoids the need for precise control of regularization parameters. It is reflective of the practice in modern machine learning applications like image and speech recognition, where signal processing-based features are individually weak but in great abundance, and models that use all of the features, notably neural networks, are highly successful. This stands in contrast to the “scientific” scenarios with informed selection of features; for example, in many science and medical applications, features are purposefully chosen based on the detailed understanding of the underlying phenomena. As illustrated by the “prescient” model that selects the best features, in that case choosing the number of features to balance bias and variance can be better than incurring the costs that come with using all of the features.

Finally we remark, that there appears to be a sharp divide between the classical analyses of statistics and machine learning in $p<n$ regimes and the modern “weak but plentiful features” interpolating settings. While the former are deeply explored, an understanding of the latter is only starting to emerge. It is clear that the best practices for model and feature selection depend crucially on the regime of the application.

Acknowledgements

We thank the anonymous referees for their remarks and suggestions (which, in particular, led to the inclusion of Section 2.3). This work was carried out in part while MB was at The Ohio State University. This research was supported by NSF CCF-1740833 and IIS-1815697 awards, a Sloan Research Fellowship, a Google Faculty Award, and a Cheung-Kong Graduate School of Business Fellowship.

Appendix A Proof of Theorem 2

We first consider $p>n$ (i.e., $\alpha>1$ ). From the proof of Theorem 1, we have the decomposition

[TABLE]

where ${\boldsymbol{\Pi}}_{T}$ is the orthogonal projection matrix for the row space of ${\boldsymbol{X}}_{T}$ , and ${\boldsymbol{\eta}}$ is normal with mean zero and covariance $(\|{\boldsymbol{\beta}}_{T^{c}}\|^{2}+\sigma^{2}){\boldsymbol{I}}$ and independent of ${\boldsymbol{X}}_{T}$ . By symmetry of the standard normal distribution, the first term $\|({\boldsymbol{I}}-{\boldsymbol{\Pi}}_{T}){\boldsymbol{\beta}}_{T}\|^{2}$ is the squared distance from ${\boldsymbol{\beta}}_{T}$ to a uniformly random $n$ -dimensional subspace of ${\mathbb{R}}^{p}$ . This squared distance has the same distribution as the squared distance from a uniformly random vector of length $\|{\boldsymbol{\beta}}_{T}\|$ to a fixed $n$ -dimensional subspace of ${\mathbb{R}}^{p}$ . This argument was also used by [DG03] in their proof of the Johnson-Lindenstrauss lemma. By Lemma 2.2 from [DG03], we have for any $\epsilon\in(0,1)$ ,

[TABLE]

The second term $\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2}$ is a (random) quadratic form in ${\boldsymbol{\eta}}$ . Let ${\boldsymbol{K}}_{T}:={\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*}$ , which is non-singular almost surely. By Lemma 4 from [Das00], we have for any $\epsilon\in(0,1)$ ,

[TABLE]

where $\kappa({\boldsymbol{X}}_{T})=\sigma_{\max}({\boldsymbol{X}}_{T})/\sigma_{\min}({\boldsymbol{X}}_{T})$ is the ratio of the largest singular value of ${\boldsymbol{X}}_{T}$ to the smallest singular value of ${\boldsymbol{X}}_{T}$ . For any $t>0$ ,

[TABLE]

These inequalities follow from Gaussian comparison inequalities and concentration of measure on the sphere and in Gaussian space [RV09, Ver18, see, e.g.,]. Therefore, for $p>(1+t)^{2}n$ ,

[TABLE]

Finally, observe that $1/({\boldsymbol{K}}_{T}^{-1})_{i,i}$ has a $\chi^{2}$ -distribution with $p-n+1$ degrees of freedom. Therefore, again using Lemma 4 from [Das00] and a union bound, we have for any $\epsilon\in(0,1)$ ,

[TABLE]

Putting these probability inequalities together (with $t=(1-\epsilon)(\sqrt{\alpha}-1)$ ) completes the proof for $p>n$ .

Now we consider $p<n$ (i.e., $\alpha<1$ ). We have

[TABLE]

The matrix ${\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}$ is non-singular almost surely, so $\|\hat{\boldsymbol{\beta}}_{T}-{\boldsymbol{\beta}}\|^{2}={\boldsymbol{\eta}}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}={\boldsymbol{\eta}}^{*}{\boldsymbol{K}}_{T}^{\dagger}{\boldsymbol{\eta}}$ also holds almost surely. Note that ${\boldsymbol{K}}_{T}$ has the same eigenvalues as ${\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T}$ , and hence ${\boldsymbol{K}}_{T}^{\dagger}$ has the same eigenvalues as $({\boldsymbol{X}}_{T}^{*}{\boldsymbol{X}}_{T})^{-1}$ . Therefore, following essentially the same arguments as above for handling $\|{\boldsymbol{X}}_{T}^{*}({\boldsymbol{X}}_{T}{\boldsymbol{X}}_{T}^{*})^{\dagger}{\boldsymbol{\eta}}\|^{2}$ (but switching the roles of $p$ and $n$ , and hence replacing $\alpha$ with $\alpha^{-1}$ ) completes the proof for $p<n$ . ∎

Appendix B Confidence bounds

Fixed-level confidence bounds can be immediately derived from the probability inequalities in Appendix A.

Proposition 2.

Consider the setting from Theorem 1 and fix any $\delta\in(0,1)$ . If $p<n$ , then with probability at least $1-\delta$ ,

[TABLE]

If $p>n$ , then with probability at least $1-\delta$ ,

[TABLE]

In the expressions above, we assume $n$ and $p$ are large enough (perhaps in relation to each other) so that all denominators are positive.

Appendix C Proof of Proposition 1

Let $X_{1},\dotsc,X_{p}$ denote a random sample of cardinality $p$ from the finite population $(\beta_{1}^{2},\dotsc,\beta_{D}^{2})$ , drawn without replacement, so that $\|{\boldsymbol{\beta}}_{T}\|^{2}=\sum_{j=1}^{p}X_{j}$ . Since $\|{\boldsymbol{\beta}}_{T^{c}}\|^{2}=\|{\boldsymbol{\beta}}\|^{2}-\|{\boldsymbol{\beta}}_{T}\|^{2}$ , we have

[TABLE]

Observe that the finite population $(\beta_{1}^{2},\dotsc,\beta_{D}^{2})$ has mean $\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{2}$ , variance $\tfrac{1}{D}\sum_{j=1}^{D}\beta_{j}^{4}-(\tfrac{1}{D}\sum_{j=1}^{D}\beta_{j}^{2})^{2}\leq\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{4}\mu^{2}-(\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{2})^{2}=\tfrac{1}{D}\|{\boldsymbol{\beta}}\|^{4}(\mu^{2}-\tfrac{1}{D})$ , and range $\max_{j\in[D]}\beta_{j}^{2}=\|{\boldsymbol{\beta}}\|^{2}\mu^{2}$ . Therefore, Proposition 1.4 of [BM15] and a union bound implies, with probability at least $1-2e^{-t}$ ,

[TABLE]

If $p/D$ is more than $1/2$ , then we can replace $p/D$ by $1-p/D$ on the right-hand side by analogously applying the previous argument to the random sample of cardinality $D-p$ that determines ${\boldsymbol{\beta}}_{T^{c}}$ . ∎

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AS 17] Madhu S Advani and Andrew M Saxe “High-dimensional dynamics of generalization error in neural networks” In ar Xiv preprint ar Xiv:1710.03667 , 2017
2[Bar+20] Peter L Bartlett, Philip M Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences National Acad Sciences, 2020
3[Bel+19] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32 , 2019, pp. 15849–15854
4[BF 83] Leo Breiman and David Freedman “How many variables should be entered in a regression equation?” In Journal of the American Statistical Association 78.381 Taylor & Francis Group, 1983, pp. 131–136
5[BHX 19] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In ar Xiv preprint ar Xiv:1903.07571 v 1 , 2019
6[BM 15] Rémi Bardenet and Odalric-Ambrym Maillard “Concentration inequalities for sampling without replacement” In Bernoulli 21.3 Bernoulli Society for Mathematical Statistics Probability, 2015, pp. 1361–1385
7[BO 98] Siegfried Bös and Manfred Opper “Dynamics of batch training in a perceptron” In Journal of Physics A: Mathematical and General 31.21 IOP Publishing, 1998, pp. 4835
8[Das 00] Sanjoy Dasgupta “Learning probability distributions”, 2000

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Two models of double descent for weak features

Abstract

1 Introduction

1.1 Related and concurrent works

2 Gaussian model

2.1 Prediction risk

Theorem 1**.**

Corollary 1**.**

Proof.

2.2 Proof of Theorem 1

Classical regime.

Interpolating regime.

2.3 Concentration

Theorem 2**.**

Proposition 1**.**

3 Fourier series model

Theorem 3**.**

4 Discussion

Acknowledgements

Appendix A Proof of Theorem 2

Appendix B Confidence bounds

Proposition 2**.**

Appendix C Proof of Proposition 1

Theorem 1.

Corollary 1.

Theorem 2.

Proposition 1.

Theorem 3.

Proposition 2.