Optimal Subsampling Bootstrap for Massive Data

Yingying Ma; Chenlei Leng; Hansheng Wang

arXiv:2302.07533·stat.ME·February 16, 2023

Optimal Subsampling Bootstrap for Massive Data

Yingying Ma, Chenlei Leng, Hansheng Wang

PDF

Open Access

TL;DR

This paper introduces a new hyperparameter selection method for subsampling bootstrap techniques, optimizing accuracy and computational efficiency for large datasets, and demonstrates its effectiveness through simulations.

Contribution

It develops a closed-form hyperparameter tuning framework for subsampling bootstrap methods, enhancing their performance on massive data.

Findings

01

Optimal hyperparameters improve bootstrap accuracy

02

Framework reduces computational costs

03

Simulation results show performance gains

Abstract

The bootstrap is a widely used procedure for statistical inference because of its simplicity and attractive statistical properties. However, the vanilla version of bootstrap is no longer feasible computationally for many modern massive datasets due to the need to repeatedly resample the entire data. Therefore, several improvements to the bootstrap method have been made in recent years, which assess the quality of estimators by subsampling the full dataset before resampling the subsamples. Naturally, the performance of these modern subsampling methods is influenced by tuning parameters such as the size of subsamples, the number of subsamples, and the number of resamples per subsample. In this paper, we develop a novel hyperparameter selection methodology for selecting these tuning parameters. Formulated as an optimization problem to find the optimal value of some measure of accuracy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Machine Learning and Data Classification · Gaussian Processes and Bayesian Inference