Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li

TL;DR
This paper evaluates the ability of large language model agents to generate real-world observational evidence in medical databases, revealing significant limitations and variability in performance across tasks and scaffolds.
Contribution
Introduces RWE-bench, a novel benchmark for assessing LLM agents in executing and constructing evidence bundles from real-world medical data, with comprehensive evaluation and error analysis.
Findings
Best agent achieves 39.9% task success rate
Agent scaffold choice significantly impacts performance
Automated error localization aids in identifying failure modes
Abstract
Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling · Artificial Intelligence in Healthcare and Education
