Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts
Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty

TL;DR
This paper benchmarks 13 open-source large language models on pairwise causal discovery tasks in biomedical and multi-domain texts, revealing significant performance gaps especially on complex and implicit causal relations.
Contribution
It introduces a comprehensive evaluation framework and dataset for assessing LLMs' causal reasoning abilities, highlighting current models' limitations and providing resources for future research.
Findings
Best detection model scored 49.57%
Best extraction model scored 47.12%
Performance drops on complex and implicit causal relations
Abstract
The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% (), while the best for extraction,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Artificial Intelligence in Healthcare and Education
