Econometric Inference with Machine-Learned Proxies: Partial Identification via Data Combination
Lixiong Li

TL;DR
This paper introduces a new econometric framework for inference with machine-learned proxies, leveraging data combination and optimal transport to address biases without requiring strict ML assumptions.
Contribution
It develops a partial identification approach that uses two datasets and optimal transport, avoiding the need for ML consistency or validation samples.
Findings
Reliable size control demonstrated in Monte Carlo simulations
Provides informative confidence sets across various predictive scenarios
Does not require restrictive assumptions on ML procedures
Abstract
Empirical researchers increasingly use upstream machine-learning (ML) methods to construct proxies for latent target variables from complex, unstructured data. A naive plug-in use of such proxies in downstream econometric models, however, can lead to biased estimation and invalid inference. This paper develops a framework for partial identification and inference in general moment models with ML-generated proxies. Our approach does not require restrictive assumptions on the upstream ML procedure, such as consistency or known convergence rates, nor does it require a complete validation sample containing all variables used in the downstream analysis. Instead, we assume access to two datasets: a downstream sample containing observed covariates and the proxy, and an auxiliary validation sample containing joint observations on the proxy and its target variable. We treat the proxy as a linking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
