Linear Regressions with Combined Data
Xavier D'Haultfoeuille, Christophe Gaillac, Arnaud Maurel

TL;DR
This paper develops a method for partially identifying linear regression coefficients when outcome and covariates are observed in separate datasets without matching, providing sharp bounds and estimators, with applications to racial disparities and educational performance.
Contribution
It introduces a novel approach to partial identification in linear regressions with separate datasets, relaxing exclusion restrictions and providing computationally simple estimators.
Findings
Derived sharp bounds for regression coefficients without exclusion restrictions.
Developed asymptotically normal estimators for the bounds.
Applied methodology to real-world datasets on patent approval and education.
Abstract
We study linear regressions in a context where the outcome of interest and some of the covariates are observed in two different datasets that cannot be matched. Traditional approaches obtain point identification by relying, often implicitly, on exclusion restrictions. We show that without such restrictions, coefficients of interest can still be partially identified, with the sharp bounds taking a simple form. We obtain tighter bounds when variables observed in both datasets, but not included in the regression of interest, are available, even if these variables are not subject to specific restrictions. We develop computationally simple and asymptotically normal estimators of the bounds. Finally, we apply our methodology to estimate racial disparities in patent approval rates and to evaluate the effect of patience and risk-taking on educational performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models
