Two models of double descent for weak features
Mikhail Belkin, Daniel Hsu, Ji Xu

TL;DR
This paper provides a mathematical analysis of the double descent risk curve in simple data models, revealing how prediction risk peaks near the sample size and then decreases with more features, contrasting with prescient models.
Contribution
It introduces two models of double descent, offering a precise mathematical understanding of the risk curve in least squares/least norm predictors.
Findings
Risk peaks when features are near sample size
Risk decreases as features exceed sample size
Contrasts with prescient feature selection models
Abstract
The "double descent" risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features is close to the sample size , but also that the risk decreases towards its minimum as increases beyond . This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Two models of double descent for weak features
Mikhail Belkin
Halıcıoğlu Data Science Institute, UC San Diego, La Jolla, CA
Daniel Hsu
Department of Computer Science, Columbia University, New York, NY
Data Science Institue, Columbia University, New York, NY
Ji Xu
Department of Computer Science, Columbia University, New York, NY
Abstract
The “double descent” risk curve was proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features is close to the sample size , but also that the risk decreases towards its minimum as increases beyond . This behavior is contrasted with that of “prescient” models that select features in an a priori optimal order.
††footnotetext: E-mail: [email protected], [email protected], [email protected]
1 Introduction
The “double descent” risk curve was proposed by [Bel+19] as a general way to qualitatively describe the out-of-sample prediction performance of variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance trade-off with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications (see Section 1.1 for references). In these studies, a predictive model with parameters is fit to a training sample of size , and the test risk (i.e., out-of-sample error) is examined as a function of . When is below the sample size (for regression or binary classification), the test risk is governed by the usual bias-variance decomposition. As is increased towards , the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up, sometimes toward infinity. The classical bias-variance analysis identifies a “sweet spot” value of at which the bias and variance are balanced to achieve low test risk. However, in the “modern regime”, as grows beyond , the training risk remains zero, but the test risk decreases again, even when fitting noisy data, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). In many (but not all) cases from [Bel+19], the limiting risk as is lower than what is achieved at the “sweet spot” value of .
In this article, we show that key aspects of the “double descent” risk curve can be observed with the least squares/least norm predictor in two simple random features models. The first is a Gaussian model studied by [BF83] in the classical regime, while the second is a Fourier series model for functions on the circle. In both cases, we prove that the risk is infinite around , and decreases again as increases beyond . When the signal-to-noise ratio is high, the minimum risk is, in fact, achieved in the modern regime, when . Our results provide a precise mathematical analysis in a simple and tractable setting of the mechanism that was qualitatively described by [Bel+19]. In particular, it captures a key aspect of many practical over-parameterized models: that increasing the number of parameters to the maximum can lead to better performance. We also establish some non-asymptotic concentration phenomena in the Gaussian model.
We note that in both of the models, the features are selected randomly, which makes them useful for studying scenarios where features are plentiful but individually too “weak” to be selected in an informed manner. Such scenarios are commonplace in machine learning practice, and they should be contrasted with “scientific” scenarios where features are carefully designed or curated, as is often the case in scientific applications. For comparison, we give an example of “prescient” feature selection, where the features a priori known to be most useful are included in the model. In this case, the optimal test risk is achieved at some , which is consistent with the classical analysis of [BF83].
1.1 Related and concurrent works
The “double descent” risk curve was posited by [Bel+19] to connect the classical bias-variance trade-off to behaviors observed in over-parameterized regimes for a variety of machine learning models. The shape and features of the risk curve itself appear throughout in the literature in a number of contexts [[, e.g.,]]vallet1989linear,opper1990ability,le1991eigenvalues,krogh1992generalization,bos1998dynamics,watkin1993statistical,advani2017high; see also [Loo+20] for a “brief prehistory” that focuses on the curious peak in the curve. These prior works analyze the risk of linear classification and regression models and neural networks in high-dimensional asymptotic regimes. Our analysis in the Gaussian model gives an exact expression for the risk for any finite sample size and number of parameters.
More recently, [Nea+18] observe that similar phenomena in neural networks can be explained by a variance reduction effect of increasing network width. The transition from under- to over-parametrized regimes was recently analyzed by [Spi+18] by drawing a connection to the physical phenomenon of “jamming” in a class of glassy systems. Our analysis makes these ideas concrete and explicit in the context of simple regression models. For instance, our analysis captures the transition from under- to over-parameterized regimes at a point where an inverse Wishart random matrix has no finite expectation. It also allows us to compare the risks at any points in the curve and explain how the risk in the over-parameterized regime can be lower than any risk in the under-parameterized regime.
The initial version of this article [BHX19] appeared concurrently with the works of [Has+19], [Mut+20], and [Bar+20], all of which also study the behavior of the least squares/least norm predictor in over-parameterized linear regression. [Mut+20] focus on the well-specified scenario (essentially, ) and provide upper-bounds on the risk that go to zero as . (A related variance analysis was carried out by [Nea+18].) [Has+19] provide a much broader range of analyses in the high-dimensional asymptotic regime, including a “misspecified” setup that is related to ours. Their analyses require weaker distributional assumptions than ours, owing to their reliance on asymptotic analysis. (A special case of the results in the follow-up work by [XH19] further broadens the range of analyses to allow highly non-isotropic designs, but again only in the high-dimensional asymptotic regime.) The analysis of [Has+19] also considers the effect of ridge regularization; in particular, they show that when the optimal level of regularization is used, the risk curve no longer shows the “double descent” shape. Finally, [Bar+20] study non-asymptotic upper and lower bounds on the risk in the over-parameterized regime, and provide a characterization in terms of certain “effective dimensions” based on the tail of the eigenvalue sequence of the covariance operator.
2 Gaussian model
We consider a regression problem where the response is equal to a linear function of real-valued variables plus noise :
[TABLE]
Given iid copies of , we fit a linear model to the data only using a subset of variables.
Let be the design matrix, and let be the vector of responses. For a subset and a -dimensional vector , we use to denote its -dimensional subvector of entries from ; we also use to denote the design matrix with variables from . For , we denote its complement by . Finally, denotes the Euclidean norm.
We fit regression coefficients with
[TABLE]
Above, the symbol † denotes the Moore-Penrose pseudoinverse. In other words, we use the solution to the normal equations of least norm for and force to all-zeros.
In this section, our analysis assumes a model in which follows a standard multivariate Gaussian distribution. This Gaussian model was also studied by [BF83], although their analysis is restricted to the case where the number of variables used is always at most ; our analysis will also consider the regime.
2.1 Prediction risk
We derive a formula for the (prediction) risk of for an arbitrary choice of features , and then examine this risk under particular selection models for .
Theorem 1**.**
Assume the distribution of is the standard normal in , is a standard normal random variable independent of , and for some and . Pick any and of cardinality . The risk of , where and , is
[TABLE]
The proof of Theorem 1 is not hard, we give the details in Section 2.2. We now turn to the risk of under a random selection model for .
Corollary 1**.**
Let be a uniformly random subset of of cardinality . In the setting of Theorem 1, the risk of (taking expectation with respect to the random choice of in addition to the random design matrix and response vector) satisfies
[TABLE]
Proof.
Since is a uniformly random subset of of cardinality ,
[TABLE]
Plugging into Theorem 1 completes the proof. ∎
Thus, assuming , we observe that the risk first increases with up to the “interpolation threshold” (), after which the risk decreases with . Moreover, when the signal-to-noise ratio is larger than , the risk is smallest at ; in particular, it is smaller than the risk at any . This is the “double descent” risk curve where the first “descent” is degenerate (i.e., the “sweet spot” that balances bias and variance is at ). See Figure 1 for an illustration.
It is worth pointing out that the behavior under the random selection model of can be very different from that under a deterministic model of . Consider including variables in by decreasing order of —a kind of “prescient” selection model studied by [BF83]. The behavior of the risk as a function of , illustrated in Figure 2, reveals a striking difference between the random selection model and the “prescient” selection model.
2.2 Proof of Theorem 1
Recall that is assumed to follow a standard normal distribution in . Since is isotropic (i.e., zero mean and identity covariance), the mean squared prediction error of any can be written as
[TABLE]
Since , it follows that the risk of is
[TABLE]
Classical regime.
The risk of was computed by [BF83] in the regime where :
[TABLE]
Interpolating regime.
We consider the regime where . Recall that the pseudoinverse of can be written as . Thus, letting ,
[TABLE]
On the right hand side, the first term is the orthogonal projection of onto the null space of , while the second term is a vector in the row space of . By the Pythagorean theorem, the squared norm of their sum is equal to the sum of their squared norms, so
[TABLE]
We analyze the expected values of these two terms by exploiting properties of the standard normal distribution.
First term.
Note that is the orthogonal projection matrix for the row space of . So, by the Pythagorean theorem, we have
[TABLE]
By rotational symmetry of the standard normal distribution, it follows that
[TABLE]
Therefore
[TABLE]
Second term.
We use the “trace trick” to write
[TABLE]
where the second equality holds almost surely because is almost surely invertible. Since and are uncorrelated, it follows that
[TABLE]
The distribution of is normal with mean zero and covariance , so
[TABLE]
The distribution of is inverse-Wishart with identity scale matrix and degrees-of-freedom. Each diagonal entry of , for , has a reciprocal that follows the distribution with degrees-of-freedom. Hence if and if . Therefore
[TABLE]
We conclude that
[TABLE]
Combining the first and second terms gives the claimed expression for the risk. ∎
2.3 Concentration
We briefly consider the measure concentration of .
Theorem 2**.**
Consider the setting from Theorem 1, and fix any . If , then
[TABLE]
with probability at least
[TABLE]
If , then
[TABLE]
with probability at least
[TABLE]
The proof is given in Appendix A. The main idea for the case is as follows. From the proof of Theorem 1, we have the decomposition
[TABLE]
The first term is the squared distance from to a uniformly random -dimensional subspace of . This squared distance has the same distribution as the squared distance from a uniformly random vector of length to a fixed -dimensional subspace of . Thus measure concentration on the unit sphere can be used here. The second term is a (random) quadratic form in the Gaussian random vector . Gaussian concentration is readily applied after controlling the spectral properties of the Wishart random matrix . (The case is similar to the analysis of this second term.)
The same arguments can be used to give fixed-level confidence bounds; see Proposition 2 in Appendix B.
Finally, it is also possible to compare to (and to ) under the random selection model of from Corollary 1 using concentration inequalities for sampling without replacement [BM15, see, e.g.,]. The following is a simple consequence of Proposition 1.4 of [BM15].
Proposition 1**.**
For any , with probability at least ,
[TABLE]
where .
The proof is in Appendix C. The crucial parameter has range . It is small when there are many relevant “weak” features, each with a relatively small coefficient in ; conversely, it is large when is concentrated on a sparse subset of features.
3 Fourier series model
In this section, we consider a noise-free Fourier series model, which can be regarded as a one-dimensional version of the random Fourier features model studied by [RR08] for functions defined on the unit circle.
Let denote the discrete Fourier transform matrix: its -th entry is
[TABLE]
where is a primitive root of unity. Let for some . Consider the following observation model:
and are independent random subsets of . For any , the membership of in (respectively, ) is determined by an independent Bernoulli variable with mean (respectively, ). 2. 2.
We observe the design matrix and -dimensional vector of responses . Here, is the submatrix of with rows from and columns from , and is the subvector of of entries from .
We fit regression coefficients with
[TABLE]
One important property of the discrete Fourier transform matrix that we use is that the matrix has rank for any . This is a consequence of the fact that is Vandermonde. Thus, we have
[TABLE]
In the remainder of this section, we analyze the risk of under a random model for , where
[TABLE]
(which implies ). The random choice of is independent of and . Considering the risk under this random model for is a form of average-case analysis. For simplicity, we only consider the regime where .
Following the arguments from Section 2.1, we have
[TABLE]
Now we take (conditional) expectations with respect to , given and :
[TABLE]
Since has rank , the first trace expression is equal to
[TABLE]
For the second trace expression, we use the explicit formula for and the fact that to obtain
[TABLE]
where the are the eigenvalues of . Therefore, from Equation 1, we have
[TABLE]
To determine the asymptotic behavior of , we use a recent result of [Far11]:
[TABLE]
as with and held fixed. Further, under this limit, we have
[TABLE]
since . Hence we have the following:
Theorem 3**.**
Assume the setting as above, with and and held fixed. Then
[TABLE]
Note that the right-hand side in the equation from Theorem 3 is well-defined in the limit because the ratios are fixed. It diverges to when is close to , and decreases as approaches . This is the same behavior as in the Gaussian model from Section 2 with random feature selection; we depict a non-asymptotic instantiation of it in Figure 3.
4 Discussion
Our analysis shows that when features are chosen in an uninformed manner, it may be optimal to choose as many as possible—even more than the number of data—rather than limit the number to that which balances bias and variance as suggested by classical analyses. This choice is simple, both conceptually and algorithmically (although it may incur a computational penalty for processing large numbers of parameters), and avoids the need for precise control of regularization parameters. It is reflective of the practice in modern machine learning applications like image and speech recognition, where signal processing-based features are individually weak but in great abundance, and models that use all of the features, notably neural networks, are highly successful. This stands in contrast to the “scientific” scenarios with informed selection of features; for example, in many science and medical applications, features are purposefully chosen based on the detailed understanding of the underlying phenomena. As illustrated by the “prescient” model that selects the best features, in that case choosing the number of features to balance bias and variance can be better than incurring the costs that come with using all of the features.
Finally we remark, that there appears to be a sharp divide between the classical analyses of statistics and machine learning in regimes and the modern “weak but plentiful features” interpolating settings. While the former are deeply explored, an understanding of the latter is only starting to emerge. It is clear that the best practices for model and feature selection depend crucially on the regime of the application.
Acknowledgements
We thank the anonymous referees for their remarks and suggestions (which, in particular, led to the inclusion of Section 2.3). This work was carried out in part while MB was at The Ohio State University. This research was supported by NSF CCF-1740833 and IIS-1815697 awards, a Sloan Research Fellowship, a Google Faculty Award, and a Cheung-Kong Graduate School of Business Fellowship.
Appendix A Proof of Theorem 2
We first consider (i.e., ). From the proof of Theorem 1, we have the decomposition
[TABLE]
where is the orthogonal projection matrix for the row space of , and is normal with mean zero and covariance and independent of . By symmetry of the standard normal distribution, the first term is the squared distance from to a uniformly random -dimensional subspace of . This squared distance has the same distribution as the squared distance from a uniformly random vector of length to a fixed -dimensional subspace of . This argument was also used by [DG03] in their proof of the Johnson-Lindenstrauss lemma. By Lemma 2.2 from [DG03], we have for any ,
[TABLE]
The second term is a (random) quadratic form in . Let , which is non-singular almost surely. By Lemma 4 from [Das00], we have for any ,
[TABLE]
where is the ratio of the largest singular value of to the smallest singular value of . For any ,
[TABLE]
These inequalities follow from Gaussian comparison inequalities and concentration of measure on the sphere and in Gaussian space [RV09, Ver18, see, e.g.,]. Therefore, for ,
[TABLE]
Finally, observe that has a -distribution with degrees of freedom. Therefore, again using Lemma 4 from [Das00] and a union bound, we have for any ,
[TABLE]
Putting these probability inequalities together (with ) completes the proof for .
Now we consider (i.e., ). We have
[TABLE]
The matrix is non-singular almost surely, so also holds almost surely. Note that has the same eigenvalues as , and hence has the same eigenvalues as . Therefore, following essentially the same arguments as above for handling (but switching the roles of and , and hence replacing with ) completes the proof for . ∎
Appendix B Confidence bounds
Fixed-level confidence bounds can be immediately derived from the probability inequalities in Appendix A.
Proposition 2**.**
Consider the setting from Theorem 1 and fix any . If , then with probability at least ,
[TABLE]
If , then with probability at least ,
[TABLE]
In the expressions above, we assume and are large enough (perhaps in relation to each other) so that all denominators are positive.
Appendix C Proof of Proposition 1
Let denote a random sample of cardinality from the finite population , drawn without replacement, so that . Since , we have
[TABLE]
Observe that the finite population has mean , variance , and range . Therefore, Proposition 1.4 of [BM15] and a union bound implies, with probability at least ,
[TABLE]
If is more than , then we can replace by on the right-hand side by analogously applying the previous argument to the random sample of cardinality that determines . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AS 17] Madhu S Advani and Andrew M Saxe “High-dimensional dynamics of generalization error in neural networks” In ar Xiv preprint ar Xiv:1710.03667 , 2017
- 2[Bar+20] Peter L Bartlett, Philip M Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences National Acad Sciences, 2020
- 3[Bel+19] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32 , 2019, pp. 15849–15854
- 4[BF 83] Leo Breiman and David Freedman “How many variables should be entered in a regression equation?” In Journal of the American Statistical Association 78.381 Taylor & Francis Group, 1983, pp. 131–136
- 5[BHX 19] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In ar Xiv preprint ar Xiv:1903.07571 v 1 , 2019
- 6[BM 15] Rémi Bardenet and Odalric-Ambrym Maillard “Concentration inequalities for sampling without replacement” In Bernoulli 21.3 Bernoulli Society for Mathematical Statistics Probability, 2015, pp. 1361–1385
- 7[BO 98] Siegfried Bös and Manfred Opper “Dynamics of batch training in a perceptron” In Journal of Physics A: Mathematical and General 31.21 IOP Publishing, 1998, pp. 4835
- 8[Das 00] Sanjoy Dasgupta “Learning probability distributions”, 2000
