RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen; Weixiang Shen; Tobias Susetzky; Chen (Cherise) Chen; Jun Li; Yuyuan Liu; Xuepeng Zhang; Zhenyu Gong; Daniel Rueckert; Jiazhen Pan

arXiv:2605.13542·cs.AI·May 14, 2026

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen (Cherise) Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

PDF

2 Repos

TL;DR

RealICU is a benchmark for evaluating large language models' understanding of long, evolving ICU data, highlighting current models' limitations in clinical reasoning and safety.

Contribution

We introduce RealICU, a realistic ICU benchmark with hindsight annotations, revealing LLM shortcomings and proposing structured-memory agents for better reasoning.

Findings

01

Existing LLMs perform poorly on ICU tasks.

02

Memory-augmented models show some improvement but still have safety issues.

03

Hindsight annotations enable more accurate assessment of LLM reasoning.

Abstract

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.