ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room
Nikita Mehandru, Niloufar Golchini, Namrata Garg, Kathy T. LeSaint, Christopher J. Nash, Anu Ramachandran, Travis Zack, Liam G. McCoy, Adam Rodman, David Bamman, Melanie Molina, Ahmed Alaa

TL;DR
ER-Reason is a comprehensive benchmark dataset with real-world clinical notes and SCT-style questions designed to evaluate LLM reasoning across the entire emergency medicine workflow, addressing limitations of prior stylized benchmarks.
Contribution
The paper introduces ER-Reason, a large-scale, real-world clinical reasoning benchmark with detailed workflow coverage and evidence-updating assessment for LLMs.
Findings
LLMs show varied performance across different clinical reasoning tasks.
Existing benchmarks do not fully capture real-world reasoning challenges.
ER-Reason reveals nuanced reasoning failures in LLMs.
Abstract
Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a clinical workflow, and rely on stylized vignettes rather than real-world clinical documentation. As a result, recent studies have found significant discrepancies between LLM performance on stylized benchmarks derived from medical licensing exams and their performance in real-world prospective studies. To address these limitations, we introduce ER-Reason, a benchmark designed to evaluate LLM reasoning as clinical evidence accumulates across decision-making tasks spanning the full workflow of emergency medicine. ER-Reason comprises 25,174 de-identified clinical notes from 3,437 patients, supporting evaluation across all stages of the emergency department…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
