More Data Can Hurt for Linear Regression: Sample-wise Double Descent

Preetum Nakkiran

arXiv:1912.07242·stat.ML·December 17, 2019·42 cites

More Data Can Hurt for Linear Regression: Sample-wise Double Descent

Preetum Nakkiran

PDF

Open Access 1 Repo

TL;DR

This paper reveals that in overparameterized linear regression with Gaussian covariates, increasing data can worsen test risk due to a unique bias-variance tradeoff, challenging the assumption that more data always improves model performance.

Contribution

It isolates and explains the double-descent phenomenon in a simple linear regression setting, highlighting how additional data can increase test risk.

Findings

01

Test risk can increase with more samples in overparameterized linear regression.

02

Bias decreases while variance increases with more data, causing the double-descent.

03

The phenomenon is explained through an unconventional bias-variance tradeoff.

Abstract

In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "double-descent" phenomenon in linear models. In this note, we isolate and understand this behavior in an extremely simple setting: linear regression with isotropic Gaussian covariates. In particular, this occurs due to an unconventional type of bias-variance tradeoff in the overparameterized regime: the bias decreases with more samples, but variance increases.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IyarLin/paper_results_reproduction
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRandom Matrices and Applications · Statistical Methods and Inference · Statistical Methods and Bayesian Inference

MethodsTest · Linear Regression