Subsampling Suffices for Adaptive Data Analysis

Guy Blanc

arXiv:2302.08661·cs.LG·September 25, 2024

Subsampling Suffices for Adaptive Data Analysis

Guy Blanc

PDF

Open Access

TL;DR

This paper demonstrates that simple subsampling techniques can ensure the validity of adaptive data analysis, providing a robust and practical approach that generalizes well even with multiple, adaptively chosen queries.

Contribution

It introduces a straightforward subsampling-based framework that guarantees query responses remain representative under adaptive analysis, expanding applicability beyond prior complex methods.

Findings

01

Subsampling ensures generalization in adaptive data analysis.

02

The proposed mechanism is simple yet state-of-the-art for statistical queries.

03

Framework models real-world scenarios not covered by previous work.

Abstract

Ensuring that analyses performed on a dataset are representative of the entire population is one of the central problems in statistics. Most classical techniques assume that the dataset is independent of the analyst's query and break down in the common setting where a dataset is reused for multiple, adaptively chosen, queries. This problem of \emph{adaptive data analysis} was formalized in the seminal works of Dwork et al. (STOC, 2015) and Hardt and Ullman (FOCS, 2014). We identify a remarkably simple set of assumptions under which the queries will continue to be representative even when chosen adaptively: The only requirements are that each query takes as input a random subsample and outputs few bits. This result shows that the noise inherent in subsampling is sufficient to guarantee that query responses generalize. The simplicity of this subsampling-based framework allows it to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Machine Learning and Algorithms · Data Stream Mining Techniques