AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment
Robin Linzmayer (1, 2), Georgianna Lin (2), Di Coneybeare (3), Jason Chu (3), Trudi Cloyd (3), Manish Garg (3), Miles Gordon (3), Elizabeth Hartofilis (3), Benjamin Hong (3), Ashraf Hussain (3), Eugene Y. Kim (3), Oluchi Iheagwara King (3), Ross McCormack (3), Erica Olsen (3)

TL;DR
AcuityBench is a comprehensive benchmark designed to evaluate language models' ability to accurately identify medical care urgency across diverse real-world health scenarios, addressing a critical safety aspect.
Contribution
It introduces a unified framework and dataset for assessing acuity detection in language models, highlighting variability and challenges in safety-critical health applications.
Findings
Models show significant variation in acuity accuracy.
Conversational responses tend to under-triage more than QA formats.
Models do not match physician judgment in ambiguous cases.
Abstract
We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
