Subsampling Suffices for Adaptive Data Analysis
Guy Blanc

TL;DR
This paper demonstrates that simple subsampling techniques can ensure the validity of adaptive data analysis, providing a robust and practical approach that generalizes well even with multiple, adaptively chosen queries.
Contribution
It introduces a straightforward subsampling-based framework that guarantees query responses remain representative under adaptive analysis, expanding applicability beyond prior complex methods.
Findings
Subsampling ensures generalization in adaptive data analysis.
The proposed mechanism is simple yet state-of-the-art for statistical queries.
Framework models real-world scenarios not covered by previous work.
Abstract
Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Machine Learning and Algorithms · Data Stream Mining Techniques
