AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

Robin Linzmayer (1; 2); Georgianna Lin (2); Di Coneybeare (3); Jason Chu (3); Trudi Cloyd (3); Manish Garg (3); Miles Gordon (3); Elizabeth Hartofilis (3); Benjamin Hong (3); Ashraf Hussain (3); Eugene Y. Kim (3); Oluchi Iheagwara King (3); Ross McCormack (3); Erica Olsen (3); John K. Riggins Jr (3); Mustafa N. Rasheed (3); Dana L. Sacco (3); Vinay Saggar (3); Osman R. Sayan (3); Amit Shembekar (3); Janice Shin-Kim (3); Wendy W. Sun (3); Bernard P. Chang (3); David Kessler (3); No\'emie Elhadad (1; 2) ((1) Department of Computer Science; Columbia University; (2) Department of Biomedical Informatics; Columbia University; (3) Department of Emergency Medicine; Columbia University Irving Medical Center)

arXiv:2605.11398·cs.AI·May 13, 2026

AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

Robin Linzmayer (1, 2), Georgianna Lin (2), Di Coneybeare (3), Jason Chu (3), Trudi Cloyd (3), Manish Garg (3), Miles Gordon (3), Elizabeth Hartofilis (3), Benjamin Hong (3), Ashraf Hussain (3), Eugene Y. Kim (3), Oluchi Iheagwara King (3), Ross McCormack (3), Erica Olsen (3)

PDF

TL;DR

AcuityBench is a comprehensive benchmark designed to evaluate language models' ability to accurately identify medical care urgency across diverse real-world health scenarios, addressing a critical safety aspect.

Contribution

It introduces a unified framework and dataset for assessing acuity detection in language models, highlighting variability and challenges in safety-critical health applications.

Findings

01

Models show significant variation in acuity accuracy.

02

Conversational responses tend to under-triage more than QA formats.

03

Models do not match physician judgment in ambiguous cases.

Abstract

We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.