Information-Based Optimal Subdata Selection for Big Data Linear   Regression

HaiYing Wang; Min Yang; and John Stufken

arXiv:1710.10382·stat.ME·June 27, 2019

Information-Based Optimal Subdata Selection for Big Data Linear Regression

HaiYing Wang, Min Yang, and John Stufken

PDF

TL;DR

This paper introduces IBOSS, a novel data reduction method for big data linear regression that outperforms traditional subsampling techniques in speed, accuracy, and scalability, especially for large datasets.

Contribution

The paper proposes IBOSS, a new optimal subdata selection method that improves computational efficiency and estimator accuracy over existing subsampling approaches in big data linear regression.

Findings

01

IBOSS is significantly faster than existing methods.

02

Estimator variances decrease as full data size increases.

03

IBOSS performs well in simulations and real data analysis.

Abstract

Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in big data analysis is data reduction. Existing investigations in the context of linear regression focus on subsampling-based methods. However, not only is this approach prone to sampling errors, it also leads to a covariance matrix of the estimators that is typically bounded from below by a term that is of the order of the inverse of the subdata size. We propose a novel approach, termed information-based optimal subdata selection (IBOSS). Compared to leading existing subdata methods, the IBOSS approach has the following advantages: (i) it is significantly faster; (ii) it is suitable for distributed parallel computing; (iii) the variances of the slope parameter estimators…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.