Inference for Regression with Variables Generated by AI or Machine   Learning

Laura Battaglia; Timothy Christensen; Stephen Hansen; Szymon; Sacher

arXiv:2402.15585·econ.EM·May 1, 2025·2 cites

Inference for Regression with Variables Generated by AI or Machine Learning

Laura Battaglia, Timothy Christensen, Stephen Hansen, Szymon, Sacher

PDF

Open Access

TL;DR

This paper demonstrates that using AI or machine learning generated variables directly in regression can cause bias and invalid inference, and proposes methods to correct this issue for reliable statistical analysis.

Contribution

It introduces two novel methods for valid inference when using AI/ML-generated variables in regression models, addressing a key challenge in modern econometrics.

Findings

01

Bias correction with confidence intervals improves inference accuracy

02

Joint estimation methods outperform naive plug-in approaches

03

Applications show practical effectiveness in label imputation and dimensionality reduction

Abstract

Researchers now routinely use AI or other machine learning methods to estimate latent variables of economic interest, then plug-in the estimates as covariates in a regression. We show both theoretically and empirically that naively treating AI/ML-generated variables as "data" leads to biased estimates and invalid inference. To restore valid inference, we propose two methods: (1) an explicit bias correction with bias-corrected confidence intervals, and (2) joint estimation of the regression parameters and latent variables. We illustrate these ideas through applications involving label imputation, dimensionality reduction, and index construction via classification and aggregation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Statistical Methods and Models · Fault Detection and Control Systems · Statistical Methods and Inference