Data Reuse and the Long Shadow of Error: Splitting, Subsampling, and Prospectively Managing Inferential Errors

Reid Dale; Jordan Rodu; Maria E. Currie; Mike Baiocchi

arXiv:2604.07580·math.ST·April 10, 2026

Data Reuse and the Long Shadow of Error: Splitting, Subsampling, and Prospectively Managing Inferential Errors

Reid Dale, Jordan Rodu, Maria E. Currie, Mike Baiocchi

PDF

TL;DR

This paper investigates subsampling methods for independent hypothesis testing on shared datasets, establishing their asymptotic properties and demonstrating their effectiveness in controlling error dependence with minimal coordination.

Contribution

It introduces a formal framework for subsampling techniques to manage dependence in multiple testing, including asymptotic normality results and optimality of data splitting.

Findings

01

Data overlap controls dependence in multiple tests.

02

Subsampling can achieve near-independent error control.

03

Bounded EVR decreases quadratically with the data fraction r.

Abstract

When multiple investigators analyze a common dataset, the data reuse induces dependence across testing procedures, affecting the distribution of errors. Existing techniques of managing dependent tests require either cross-study coordination or post-hoc correction. These methods do not apply to the current practice of uncoordinated groups of researchers independently evaluating hypotheses on a shared dataset. We investigate the use of subsampling techniques implemented at the level of individual investigators to remedy dependence with minimal coordination. To this end, we establish the asymptotic joint normality of test statistics for the class of asymptotically linear test statistics, decomposing the covariance matrix as the product of a data overlap term and a test statistic association term. This decomposition shows that controlling data overlap is sufficient to control dependence,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.