Generalization in Adaptive Data Analysis and Holdout Reuse

Cynthia Dwork; Vitaly Feldman; Moritz Hardt; Toniann Pitassi; Omer; Reingold; Aaron Roth

arXiv:1506.02629·cs.LG·September 28, 2015·101 cites

Generalization in Adaptive Data Analysis and Holdout Reuse

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer, Reingold, Aaron Roth

PDF

Open Access 1 Repo

TL;DR

This paper introduces a practical method for reusing holdout sets in adaptive data analysis to prevent overfitting, extending differential privacy techniques and proposing a unifying framework based on max-information.

Contribution

It presents a new algorithm for adaptive holdout validation, broadens the application of differential privacy in data analysis, and introduces the concept of approximate max-information for data reuse guarantees.

Findings

01

The proposed algorithm effectively prevents overfitting in adaptive hypothesis testing.

02

Differential privacy-based methods can be applied to broader adaptive analysis scenarios.

03

Approximate max-information unifies different approaches to data reuse in adaptive settings.

Abstract

Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions. In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DIDSR/ThresholdoutAUC
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Machine Learning and Data Classification · Adversarial Robustness in Machine Learning