Do LLMs Act as Repositories of Causal Knowledge?
Nick Huntington-Klein, Eleanor J. Murray

TL;DR
This paper investigates whether large language models can identify causal confounders, finding that they perform poorly and inconsistently, indicating current limitations in automating causal inference tasks.
Contribution
It provides an empirical evaluation of LLMs' ability to recognize confounders in a real-world medical dataset, highlighting their current shortcomings.
Findings
LLMs show mediocre confounder identification performance.
Expert-identified confounders are only slightly more recognized by LLMs.
LLM judgments are highly inconsistent across models and prompts.
Abstract
Large language models (LLMs) offer the potential to automate a large number of tasks that previously have not been possible to automate, including some in science. There is considerable interest in whether LLMs can automate the process of causal inference by providing the information about causal links necessary to build a structural model. We use the case of confounding in the Coronary Drug Project (CDP), for which there are several studies listing expert-selected confounders that can serve as a ground truth. LLMs exhibit mediocre performance in identifying confounders in this setting, even though text about the ground truth is in their training data. Variables that experts identify as confounders are only slightly more likely to be labeled as confounders by LLMs compared to variables that experts consider non-confounders. Further, LLM judgment on confounder status is highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations · International Arbitration and Investment Law · Law, AI, and Intellectual Property
MethodsCausal inference
