Inference for Regression with Variables Generated by AI or Machine Learning
Laura Battaglia, Timothy Christensen, Stephen Hansen, Szymon, Sacher

TL;DR
This paper demonstrates that using AI or machine learning generated variables directly in regression can cause bias and invalid inference, and proposes methods to correct this issue for reliable statistical analysis.
Contribution
It introduces two novel methods for valid inference when using AI/ML-generated variables in regression models, addressing a key challenge in modern econometrics.
Findings
Bias correction with confidence intervals improves inference accuracy
Joint estimation methods outperform naive plug-in approaches
Applications show practical effectiveness in label imputation and dimensionality reduction
Abstract
Researchers now routinely use AI or other machine learning methods to estimate latent variables of economic interest, then plug-in the estimates as covariates in a regression. We show both theoretically and empirically that naively treating AI/ML-generated variables as "data" leads to biased estimates and invalid inference. To restore valid inference, we propose two methods: (1) an explicit bias correction with bias-corrected confidence intervals, and (2) joint estimation of the regression parameters and latent variables. We illustrate these ideas through applications involving label imputation, dimensionality reduction, and index construction via classification and aggregation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Fault Detection and Control Systems · Statistical Methods and Inference
