Medical Large Vision Language Models with Multi-Image Visual Ability

Xikai Yang; Juzheng Miao; Yuchen Yuan; Jiaze Wang; Qi Dou; Jinpeng Li; Pheng-Ann Heng

arXiv:2505.19031·cs.CV·May 27, 2025

Medical Large Vision Language Models with Multi-Image Visual Ability

Xikai Yang, Juzheng Miao, Yuchen Yuan, Jiaze Wang, Qi Dou, Jinpeng Li, Pheng-Ann Heng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new dataset and models to improve medical vision-language models' ability to understand and analyze multiple medical images, addressing a key gap in multi-image clinical reasoning.

Contribution

The paper presents the Med-MIM dataset and fine-tuned models, advancing multi-image understanding in medical LVLMs beyond single-image capabilities.

Findings

01

Models fine-tuned on Med-MIM outperform existing LVLMs on multi-image tasks.

02

Med-MIM dataset effectively enhances multi-image reasoning in medical LVLMs.

03

Proposed models demonstrate superior performance on the Med-MIM benchmark.

Abstract

Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xikai97/med-mim
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications