Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data
Stephen Salerno, Kentaro Hoffman, Awan Afiaz, Anna Neufeld, Tyler H. McCormick, Jeffrey T. Leek

TL;DR
This paper examines the statistical challenges of using predicted data as a substitute for missing or unobserved data in scientific inference, highlighting issues of bias and variance that can lead to invalid conclusions.
Contribution
It characterizes the core statistical problems in inference with predicted data, clarifies that high predictive accuracy doesn't ensure valid inference, and reviews recent methods rooted in classical theory.
Findings
Predicted data can introduce bias in estimands.
Ignoring prediction uncertainty inflates variance.
High prediction accuracy does not guarantee valid inference.
Abstract
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Data Analysis with R · Explainable Artificial Intelligence (XAI)
