Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status
Samuel Lee, Zach Wood-Doughty

TL;DR
This paper leverages large language models trained on clinical notes to predict unobserved confounders like smoking status, enabling more accurate causal inference in observational medical data.
Contribution
It extends existing methods by using LLMs for confounder prediction and applies measurement error correction to estimate causal effects from real clinical data.
Findings
Predicted smoking status improves confounder adjustment.
Measurement error correction yields unbiased causal estimates.
Application to MIMIC dataset demonstrates practical utility.
Abstract
Causal understanding is a fundamental goal of evidence-based medicine. When randomization is impossible, causal inference methods allow the estimation of treatment effects from retrospective analysis of observational data. However, such analyses rely on a number of assumptions, often including that of no unobserved confounding. In many practical settings, this assumption is violated when important variables are not explicitly measured in the clinical record. Prior work has proposed to address unobserved confounding with machine learning by imputing unobserved variables and then correcting for the classifier's mismeasurement. When such a classifier can be trained and the necessary assumptions are met, this method can recover an unbiased estimate of a causal effect. However, such work has been limited to synthetic data, simple classifiers, and binary variables. This paper extends this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
MethodsCausal inference
