Information-Based Optimal Subdata Selection for Big Data Linear Regression
HaiYing Wang, Min Yang, and John Stufken

TL;DR
This paper introduces IBOSS, a novel data reduction method for big data linear regression that outperforms traditional subsampling techniques in speed, accuracy, and scalability, especially for large datasets.
Contribution
The paper proposes IBOSS, a new optimal subdata selection method that improves computational efficiency and estimator accuracy over existing subsampling approaches in big data linear regression.
Findings
IBOSS is significantly faster than existing methods.
Estimator variances decrease as full data size increases.
IBOSS performs well in simulations and real data analysis.
Abstract
Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
