Bootstrapping with AI/ML-generated labels
Timothy Christensen, Silvia Goncalves, Benoit Perron

TL;DR
This paper examines bootstrap methods to correct biases in regression analyses caused by AI/ML-generated labels, proposing a new coupled-label bootstrap that ensures valid inference.
Contribution
It introduces a coupled-label bootstrap approach that remains valid without strong independence assumptions, improving inference accuracy in models with ML-generated labels.
Findings
The fixed-label bootstrap is generally invalid unless a strong independence condition holds.
The coupled-label bootstrap is valid without the independence condition.
Finite-sample adjustments enhance coverage and inference reliability.
Abstract
AI/ML methods are increasingly used in economics to generate binary variables (or labels) via classification algorithms. When these generated variables are included as covariates in regressions, even small misclassification errors can induce large biases in OLS estimators and invalidate standard inference. We study whether the bootstrap can correct this bias and deliver valid inference. We first show that a seemingly natural fixed-label bootstrap, which generates data using estimated labels but relies on a corrupted version in estimation, is generally invalid unless a strong independence condition between the latent true labels and other covariates holds. We then propose a coupled-label bootstrap that jointly resamples the true and imputed labels, and show it is valid without this condition. Two finite-sample adjustments further improve coverage: a variance correction for uncertainty in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
