Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models
Kanta Yamaoka, Sumantrak Mukherjee, Thomas G\"artner, David Antony Selby, Stefan Konigorski, Eyke H\"ullermeier, Viktor Bengs, Sebastian Josef Vollmer

TL;DR
This paper introduces a benchmarking framework to evaluate large language models' ability to estimate causal effect sizes in linear Gaussian structural causal models, revealing current limitations and variability in performance.
Contribution
The paper presents Linear-LLM-SCM, a novel plug-and-play benchmarking framework for assessing LLMs' ability to perform quantitative causal reasoning in continuous domains.
Findings
LLMs show high variability in coefficient estimation.
Results are sensitive to DAG misspecification and perturbations.
Current LLMs face challenges in accurate quantitative causal reasoning.
Abstract
Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning -- estimating effect sizes that parametrize functional relationships -- remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Causal Inference Techniques · Explainable Artificial Intelligence (XAI)
