MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling

Jifan Gao; Mahmudur Rahman; John Caskey; Madeline Oguss; Ann O'Rourke; Randy Brown; Anne Stey; Anoop Mayampurath; Matthew M. Churpek; Guanhua Chen; Majid Afshar

arXiv:2508.05492·cs.LG·August 8, 2025

MoMA: A Mixture-of-Multimodal-Agents Architecture for Enhancing Clinical Prediction Modelling

Jifan Gao, Mahmudur Rahman, John Caskey, Madeline Oguss, Ann O'Rourke, Randy Brown, Anne Stey, Anoop Mayampurath, Matthew M. Churpek, Guanhua Chen, Majid Afshar

PDF

TL;DR

MoMA introduces a novel architecture leveraging multiple large language model agents to effectively integrate multimodal EHR data, including images and lab results, for improved clinical prediction accuracy.

Contribution

The paper presents MoMA, a new multimodal architecture that uses specialized LLM agents to convert diverse data types into text and combines them for better clinical predictions.

Findings

01

MoMA outperforms existing methods on three clinical prediction tasks.

02

MoMA effectively integrates non-textual modalities into LLM-based models.

03

MoMA demonstrates flexibility across various multimodal datasets.

Abstract

Multimodal electronic health record (EHR) data provide richer, complementary insights into patient health compared to single-modality data. However, effectively integrating diverse data modalities for clinical prediction modeling remains challenging due to the substantial data requirements. We introduce a novel architecture, Mixture-of-Multimodal-Agents (MoMA), designed to leverage multiple large language model (LLM) agents for clinical prediction tasks using multimodal EHR data. MoMA employs specialized LLM agents ("specialist agents") to convert non-textual modalities, such as medical images and laboratory results, into structured textual summaries. These summaries, together with clinical notes, are combined by another LLM ("aggregator agent") to generate a unified multimodal summary, which is then used by a third LLM ("predictor agent") to produce clinical predictions. Evaluating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.