How much does your data exploration overfit? Controlling bias via information usage
Daniel Russo, James Zou

TL;DR
This paper introduces an information-theoretic framework to quantify and control bias in data exploration, providing bounds and insights into how adaptive analysis can lead to false discoveries, and proposing methods to mitigate this bias.
Contribution
It proposes a mutual information-based framework to measure and bound exploration bias, connecting it to privacy concepts and offering randomized techniques for bias reduction.
Findings
Mutual information bounds are tight in natural settings.
Analysis of bias in filtering, rank selection, and clustering.
Randomization techniques can reduce exploration bias effectively.
Abstract
Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. Ultimately, which results are reported can be heavily influenced by the data. It is widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while %the adaptive nature of exploration any data-exploration renders standard statistical theory invalid, experience suggests that different types of exploratory analysis can lead to disparate levels of bias, and the degree of bias also depends on the particulars…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Statistical Methods and Bayesian Inference · Statistical Methods and Inference
