Data Gluttony: Epistemic Risks, Dependent Testing and Data Reuse in Large Datasets

Reid Dale; Jordan Rodu; Maria E. Currie; Mike Baiocchi

arXiv:2508.16552·math.ST·August 25, 2025

Data Gluttony: Epistemic Risks, Dependent Testing and Data Reuse in Large Datasets

Reid Dale, Jordan Rodu, Maria E. Currie, Mike Baiocchi

PDF

TL;DR

This paper examines the risks of data reuse in large datasets, showing how dependent testing increases inferential errors and proposing strategies like data temperance and portfolio optimization to mitigate these risks.

Contribution

It introduces a formal analysis of dependent testing risks in large datasets and proposes practical strategies for data management to reduce inferential errors.

Findings

01

Dependent testing leads to riskier distributions of errors.

02

Data temperance reduces dependence and improves inference reliability.

03

Portfolio optimization allocates data efficiently across tasks.

Abstract

Large-scale registries have collected vast amounts of data which has enabled investigators to efficiently conduct studies of observational data. Common practice is for investigators to use all data meeting the inclusion criteria of their study to perform their analysis. We term this common practice data gluttony. It has apparent formal justification insofar as this approach maximizes per-study power. But this comes at a cost: data reuse affects the shape of the tail distribution of inferential errors. Using the theory of risk orderings we demonstrate how positively dependent testing procedures result in strictly riskier distributions of inferential error. We identify two remedies to this state of affairs: research portfolio optimization and what we term data temperance. Research portfolio optimization requires that we formulate the enterprise of inference in a utility theoretic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.