The Generic Holdout: Preventing False-Discoveries in Adaptive Data   Science

Preetum Nakkiran; Jaros{\l}aw B{\l}asiok

arXiv:1809.05596·stat.ME·September 18, 2018

The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

Preetum Nakkiran, Jaros{\l}aw B{\l}asiok

PDF

Open Access

TL;DR

The paper introduces the Generic Holdout, a simple yet effective framework that enables scientists to perform adaptive data analysis with exponentially more queries while preventing false discoveries, by partitioning data and limiting information exposure.

Contribution

It proposes a new data analysis framework that significantly improves the number of valid adaptive queries, addressing false discoveries in scientific research.

Findings

01

Exponential increase in valid adaptive queries compared to previous methods.

02

Simple data partitioning and limited exposure strategy effectively prevent false discoveries.

03

Framework applicable to real-world scientific hypothesis testing.

Abstract

Adaptive data analysis has posed a challenge to science due to its ability to generate false hypotheses on moderately large data sets. In general, with non-adaptive data analyses (where queries to the data are generated without being influenced by answers to previous queries) a data set containing $n$ samples may support exponentially many queries in $n$ . This number reduces to linearly many under naive adaptive data analysis, and even sophisticated remedies such as the Reusable Holdout (Dwork et. al 2015) only allow quadratically many queries in $n$ . In this work, we propose a new framework for adaptive science which exponentially improves on this number of queries under a restricted yet scientifically relevant setting, where the goal of the scientist is to find a single (or a few) true hypotheses about the universe based on the samples. Such a setting may describe the search for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education