Data Gluttony: Epistemic Risks, Dependent Testing and Data Reuse in Large Datasets
Reid Dale, Jordan Rodu, Maria E. Currie, Mike Baiocchi

TL;DR
This paper examines the risks of data reuse in large datasets, showing how dependent testing increases inferential errors and proposing strategies like data temperance and portfolio optimization to mitigate these risks.
Contribution
It introduces a formal analysis of dependent testing risks in large datasets and proposes practical strategies for data management to reduce inferential errors.
Findings
Dependent testing leads to riskier distributions of errors.
Data temperance reduces dependence and improves inference reliability.
Portfolio optimization allocates data efficiently across tasks.
Abstract
Large-scale registries have collected vast amounts of data which has enabled investigators to efficiently conduct studies of observational data. Common practice is for investigators to use all data meeting the inclusion criteria of their study to perform their analysis. We term this common practice data gluttony. It has apparent formal justification insofar as this approach maximizes per-study power. But this comes at a cost: data reuse affects the shape of the tail distribution of inferential errors. Using the theory of risk orderings we demonstrate how positively dependent testing procedures result in strictly riskier distributions of inferential error. We identify two remedies to this state of affairs: research portfolio optimization and what we term data temperance. Research portfolio optimization requires that we formulate the enterprise of inference in a utility theoretic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
