Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

Dubai Li; Yuxiang He; Yan Hu; Yu Tian; Jingsong Li

arXiv:2603.22767·cs.AI·March 25, 2026

Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li

PDF

Open Access

TL;DR

This paper evaluates the ability of large language model agents to generate real-world observational evidence in medical databases, revealing significant limitations and variability in performance across tasks and scaffolds.

Contribution

Introduces RWE-bench, a novel benchmark for assessing LLM agents in executing and constructing evidence bundles from real-world medical data, with comprehensive evaluation and error analysis.

Findings

01

Best agent achieves 39.9% task success rate

02

Agent scaffold choice significantly impacts performance

03

Automated error localization aids in identifying failure modes

Abstract

Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education