ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Nikita Mehandru; Niloufar Golchini; Namrata Garg; Kathy T. LeSaint; Christopher J. Nash; Anu Ramachandran; Travis Zack; Liam G. McCoy; Adam Rodman; David Bamman; Melanie Molina; Ahmed Alaa

arXiv:2505.22919·cs.CL·May 12, 2026

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Nikita Mehandru, Niloufar Golchini, Namrata Garg, Kathy T. LeSaint, Christopher J. Nash, Anu Ramachandran, Travis Zack, Liam G. McCoy, Adam Rodman, David Bamman, Melanie Molina, Ahmed Alaa

PDF

TL;DR

ER-Reason is a comprehensive benchmark dataset with real-world clinical notes and SCT-style questions designed to evaluate LLM reasoning across the entire emergency medicine workflow, addressing limitations of prior stylized benchmarks.

Contribution

The paper introduces ER-Reason, a large-scale, real-world clinical reasoning benchmark with detailed workflow coverage and evidence-updating assessment for LLMs.

Findings

01

LLMs show varied performance across different clinical reasoning tasks.

02

Existing benchmarks do not fully capture real-world reasoning challenges.

03

ER-Reason reveals nuanced reasoning failures in LLMs.

Abstract

Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a clinical workflow, and rely on stylized vignettes rather than real-world clinical documentation. As a result, recent studies have found significant discrepancies between LLM performance on stylized benchmarks derived from medical licensing exams and their performance in real-world prospective studies. To address these limitations, we introduce ER-Reason, a benchmark designed to evaluate LLM reasoning as clinical evidence accumulates across decision-making tasks spanning the full workflow of emergency medicine. ER-Reason comprises 25,174 de-identified clinical notes from 3,437 patients, supporting evaluation across all stages of the emergency department…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.