REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Linna Wang; Zhixuan You; Qihui Zhang; Jiunan Wen; Ji Shi; Yimin Chen; Yusen Wang; Fanqi Ding; Ziliang Feng; Li Lu

arXiv:2511.07127·cs.LG·November 14, 2025

REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Linna Wang, Zhixuan You, Qihui Zhang, Jiunan Wen, Ji Shi, Yimin Chen, Yusen Wang, Fanqi Ding, Ziliang Feng, Li Lu

PDF

Open Access 1 Video

TL;DR

REACT-LLM is a comprehensive benchmark designed to evaluate how well large language models can incorporate causal features to improve clinical risk prediction, highlighting current limitations and potential synergies.

Contribution

This work introduces REACT-LLM, the first benchmark assessing LLMs' ability to leverage causal features in clinical prognostics across multiple outcomes and datasets.

Findings

01

LLMs perform reasonably but do not outperform traditional ML models.

02

Integrating causal features yields limited performance improvements.

03

Many causal discovery methods face challenges due to assumptions violated in clinical data.

Abstract

Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs' emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks· underline

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling