Improving optimal subsampling through stratification
Jasper B. Yang, Thomas Lumley, Bryan E. Shepherd, Pamela A. Shaw

TL;DR
This paper compares stratified and individualized optimal subsampling methods for logistic regression, demonstrating that stratified sampling often yields more efficient estimators by reducing variance contributions.
Contribution
It provides a theoretical and empirical comparison showing that optimal stratified sampling can outperform individualized sampling in efficiency for logistic regression.
Findings
Stratified sampling often more efficient than individualized sampling.
Optimal stratified designs eliminate between-stratum variance contributions.
Stratified sampling's advantages are underappreciated in current methods.
Abstract
Recent works have proposed optimal subsampling algorithms to improve computational efficiency in large datasets and to design validation studies in the presence of measurement error. Existing approaches generally fall into two categories: (i) designs that optimize individualized sampling rules, where unit-specific probabilities are assigned and applied independently, and (ii) designs based on stratified sampling with simple random sampling within strata. Focusing on the logistic regression setting, we derive the asymptotic variances of estimators under both approaches and compare them numerically through extensive simulations and an application to data from the Vanderbilt Comprehensive Care Clinic cohort. Our results reinforce that stratified sampling is not merely an approximation to individualized sampling, showing instead that optimal stratified designs are often more efficient than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Statistical Methods in Clinical Trials · Advanced Causal Inference Techniques
