From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

Zhen Chen; Yihang Fu; Gabriel Madera; Mauro Giuffre; Serina Applebaum; Hyunjae Kim; Hua Xu; Qingyu Chen

arXiv:2511.22232·cs.CV·December 1, 2025

From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum, Hyunjae Kim, Hua Xu, Qingyu Chen

PDF

Open Access

TL;DR

This paper introduces M3LLM, a multi-modal large language model trained on biomedical literature's compound images, enabling advanced multi-image understanding for clinical applications, validated by a new benchmark and superior performance.

Contribution

The paper presents a novel framework leveraging compound images from biomedical literature to train a multi-image multi-modal LLM, addressing data scarcity and enabling composite reasoning in medical contexts.

Findings

01

M3LLM outperforms existing models in multi-image understanding tasks.

02

The framework effectively generalizes to longitudinal chest X-ray analysis.

03

The approach bridges biomedical literature and clinical multi-image reasoning.

Abstract

Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI