Statistical Methods and Computing for Big Data
Chun Wang, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan

TL;DR
This paper reviews recent statistical methodologies and software tools designed to handle big data challenges, emphasizing subsampling, divide and conquer, and sequential updating techniques, with practical R package examples.
Contribution
It provides a comprehensive overview of new statistical methods and open source software tailored for big data analysis, highlighting their applications and implementation.
Findings
Introduction of subsampling, divide and conquer, and sequential updating methods.
Review of open source R packages for big data analytics.
Case study demonstrating logistic regression on airline delay data.
Abstract
Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard software tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article reviews recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and sequential updating for stream data. Software review focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
