The Everlasting Database: Statistical Validity at a Fair Price
Blake Woodworth, Vitaly Feldman, Saharon Rosset, and Nathan Srebro

TL;DR
This paper introduces a mechanism for answering unlimited adaptive statistical queries with guaranteed validity and controlled costs, addressing issues of overfitting and invalid discoveries in data analysis.
Contribution
It presents a novel pricing-based approach that ensures statistical validity for adaptive queries without assumptions on query generation, with costs scaling sublinearly.
Findings
Cost for non-adaptive queries is O(log M)
Cost for adaptive queries is O(√M)
Guarantees validity without assumptions
Abstract
The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples. Crucially, we guarantee statistical validity without any assumptions on how the queries are generated. We also ensure with high probability that the cost for non-adaptive queries is , while the cost to a potentially adaptive user who makes queries that do not depend on any others is .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Algorithms and Data Compression · Advanced Data Storage Technologies
