Provable More Data Hurt in High Dimensional Least Squares Estimator
Zeng Li, Chuanlong Xie, Qinwen Wang

TL;DR
This paper analyzes the finite-sample prediction risk of high-dimensional least squares estimators, revealing that increasing data can sometimes worsen prediction accuracy, supported by theoretical derivations including a CLT and confidence intervals.
Contribution
It provides the first finite-sample distribution and CLT for prediction risk in high-dimensional least squares, demonstrating the counterintuitive 'more data hurt' phenomenon.
Findings
Prediction risk is nonmonotonic with sample size.
Finite-sample distribution and confidence intervals are derived.
Confirmed the 'more data hurt' phenomenon in high dimensions.
Abstract
This paper investigates the finite-sample prediction risk of the high-dimensional least squares estimator. We derive the central limit theorem for the prediction risk when both the sample size and the number of features tend to infinity. Furthermore, the finite-sample distribution and the confidence interval of the prediction risk are provided. Our theoretical results demonstrate the sample-wise nonmonotonicity of the prediction risk and confirm "more data hurt" phenomenon.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRandom Matrices and Applications · Statistical Methods and Inference · Bayesian Methods and Mixture Models
