Data Reuse and the Long Shadow of Error: Splitting, Subsampling, and Prospectively Managing Inferential Errors
Reid Dale, Jordan Rodu, Maria E. Currie, Mike Baiocchi

TL;DR
This paper investigates subsampling methods for independent hypothesis testing on shared datasets, establishing their asymptotic properties and demonstrating their effectiveness in controlling error dependence with minimal coordination.
Contribution
It introduces a formal framework for subsampling techniques to manage dependence in multiple testing, including asymptotic normality results and optimality of data splitting.
Findings
Data overlap controls dependence in multiple tests.
Subsampling can achieve near-independent error control.
Bounded EVR decreases quadratically with the data fraction r.
Abstract
When multiple investigators analyze a common dataset, the data reuse induces dependence across testing procedures, affecting the distribution of errors. Existing techniques of managing dependent tests require either cross-study coordination or post-hoc correction. These methods do not apply to the current practice of uncoordinated groups of researchers independently evaluating hypotheses on a shared dataset. We investigate the use of subsampling techniques implemented at the level of individual investigators to remedy dependence with minimal coordination. To this end, we establish the asymptotic joint normality of test statistics for the class of asymptotically linear test statistics, decomposing the covariance matrix as the product of a data overlap term and a test statistic association term. This decomposition shows that controlling data overlap is sufficient to control dependence,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
