Can Large Language Models Infer Causal Relationships from Real-World Text?
Ryan Saklad, Aman Chadha, Oleg Pavlov, Raha Moraffah

TL;DR
This paper introduces a novel benchmark dataset from real-world academic texts to evaluate large language models' ability to infer causal relationships, revealing significant challenges and guiding future research.
Contribution
It presents the first real-world dataset for causal inference from texts and analyzes LLM performance across diverse, complex real-world scenarios.
Findings
LLMs achieve an average F1 score of 0.535 on the benchmark.
Performance varies with explicitness, number of causal relations, text length, and domain.
The benchmark provides targeted insights for improving LLM causal reasoning.
Abstract
Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work evaluating LLM causal reasoning primarily relies on synthetic or simplified texts with explicitly stated causal relationships. These texts typically feature short passages and few causal relations, failing to reflect the complexities of real-world reasoning. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature, which includes diverse texts with respect to length, complexity (different levels of explicitness, number of causal events and relationships), and domain. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
