The Everlasting Database: Statistical Validity at a Fair Price

Blake Woodworth; Vitaly Feldman; Saharon Rosset; and Nathan Srebro

arXiv:1803.04307·cs.LG·April 3, 2019

The Everlasting Database: Statistical Validity at a Fair Price

Blake Woodworth, Vitaly Feldman, Saharon Rosset, and Nathan Srebro

PDF

Open Access

TL;DR

This paper introduces a mechanism for answering unlimited adaptive statistical queries with guaranteed validity and controlled costs, addressing issues of overfitting and invalid discoveries in data analysis.

Contribution

It presents a novel pricing-based approach that ensures statistical validity for adaptive queries without assumptions on query generation, with costs scaling sublinearly.

Findings

01

Cost for non-adaptive queries is O(log M)

02

Cost for adaptive queries is O(√M)

03

Guarantees validity without assumptions

Abstract

The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples. Crucially, we guarantee statistical validity without any assumptions on how the queries are generated. We also ensure with high probability that the cost for $M$ non-adaptive queries is $O (lo g M)$ , while the cost to a potentially adaptive user who makes $M$ queries that do not depend on any others is $O (M)$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Algorithms and Data Compression · Advanced Data Storage Technologies