Provable More Data Hurt in High Dimensional Least Squares Estimator

Zeng Li; Chuanlong Xie; Qinwen Wang

arXiv:2008.06296·stat.ML·August 17, 2020·6 cites

Provable More Data Hurt in High Dimensional Least Squares Estimator

Zeng Li, Chuanlong Xie, Qinwen Wang

PDF

Open Access

TL;DR

This paper analyzes the finite-sample prediction risk of high-dimensional least squares estimators, revealing that increasing data can sometimes worsen prediction accuracy, supported by theoretical derivations including a CLT and confidence intervals.

Contribution

It provides the first finite-sample distribution and CLT for prediction risk in high-dimensional least squares, demonstrating the counterintuitive 'more data hurt' phenomenon.

Findings

01

Prediction risk is nonmonotonic with sample size.

02

Finite-sample distribution and confidence intervals are derived.

03

Confirmed the 'more data hurt' phenomenon in high dimensions.

Abstract

This paper investigates the finite-sample prediction risk of the high-dimensional least squares estimator. We derive the central limit theorem for the prediction risk when both the sample size and the number of features tend to infinity. Furthermore, the finite-sample distribution and the confidence interval of the prediction risk are provided. Our theoretical results demonstrate the sample-wise nonmonotonicity of the prediction risk and confirm "more data hurt" phenomenon.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRandom Matrices and Applications · Statistical Methods and Inference · Bayesian Methods and Mixture Models