AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Baraa Al Jorf, Farah E.Shamout

TL;DR
This paper systematically evaluates large language model agents for multimodal clinical prediction, revealing that single-agent systems outperform naive multi-agent setups and emphasizing the need for improved collaboration.
Contribution
It introduces a benchmark for evaluating LLM agents in healthcare, highlighting performance gaps and providing a framework for future research.
Findings
Single agent frameworks outperform naive multi-agent systems.
Single agents are better at handling multimodal data.
Single agents are better calibrated.
Abstract
Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
