More Data Can Hurt for Linear Regression: Sample-wise Double Descent
Preetum Nakkiran

TL;DR
This paper reveals that in overparameterized linear regression with Gaussian covariates, increasing data can worsen test risk due to a unique bias-variance tradeoff, challenging the assumption that more data always improves model performance.
Contribution
It isolates and explains the double-descent phenomenon in a simple linear regression setting, highlighting how additional data can increase test risk.
Findings
Test risk can increase with more samples in overparameterized linear regression.
Bias decreases while variance increases with more data, causing the double-descent.
The phenomenon is explained through an unconventional bias-variance tradeoff.
Abstract
In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Statistical Methods and Inference · Statistical Methods and Bayesian Inference
MethodsTest · Linear Regression
