AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Baraa Al Jorf; Farah E.Shamout

arXiv:2605.10286·cs.AI·May 12, 2026

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Baraa Al Jorf, Farah E.Shamout

PDF

TL;DR

This paper systematically evaluates large language model agents for multimodal clinical prediction, revealing that single-agent systems outperform naive multi-agent setups and emphasizing the need for improved collaboration.

Contribution

It introduces a benchmark for evaluating LLM agents in healthcare, highlighting performance gaps and providing a framework for future research.

Findings

01

Single agent frameworks outperform naive multi-agent systems.

02

Single agents are better at handling multimodal data.

03

Single agents are better calibrated.

Abstract

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.