Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

Sydney Anuyah; Sneha Shajee-Mohan; Ankit-Singh Chauhan; Sunandan Chakraborty

arXiv:2601.15479·cs.CL·March 13, 2026

Benchmarking LLMs for Pairwise Causal Discovery in Biomedical and Multi-Domain Contexts

Sydney Anuyah, Sneha Shajee-Mohan, Ankit-Singh Chauhan, Sunandan Chakraborty

PDF

Open Access

TL;DR

This paper benchmarks 13 open-source large language models on pairwise causal discovery tasks in biomedical and multi-domain texts, revealing significant performance gaps especially on complex and implicit causal relations.

Contribution

It introduces a comprehensive evaluation framework and dataset for assessing LLMs' causal reasoning abilities, highlighting current models' limitations and providing resources for future research.

Findings

01

Best detection model scored 49.57%

02

Best extraction model scored 47.12%

03

Performance drops on complex and implicit causal relations

Abstract

The safe deployment of large language models (LLMs) in high-stakes fields like biomedicine, requires them to be able to reason about cause and effect. We investigate this ability by testing 13 open-source LLMs on a fundamental task: pairwise causal discovery (PCD) from text. Our benchmark, using 12 diverse datasets, evaluates two core skills: 1) \textbf{Causal Detection} (identifying if a text contains a causal link) and 2) \textbf{Causal Extraction} (pulling out the exact cause and effect phrases). We tested various prompting methods, from simple instructions (zero-shot) to more complex strategies like Chain-of-Thought (CoT) and Few-shot In-Context Learning (FICL). The results show major deficiencies in current models. The best model for detection, DeepSeek-R1-Distill-Llama-70B, only achieved a mean score of 49.57\% ( $C_{d e t ec t}$ ), while the best for extraction,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Artificial Intelligence in Healthcare and Education