TL;DR
RealICU is a benchmark for evaluating large language models' understanding of long, evolving ICU data, highlighting current models' limitations in clinical reasoning and safety.
Contribution
We introduce RealICU, a realistic ICU benchmark with hindsight annotations, revealing LLM shortcomings and proposing structured-memory agents for better reasoning.
Findings
Existing LLMs perform poorly on ICU tasks.
Memory-augmented models show some improvement but still have safety issues.
Hindsight annotations enable more accurate assessment of LLM reasoning.
Abstract
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
