Medical Large Vision Language Models with Multi-Image Visual Ability
Xikai Yang, Juzheng Miao, Yuchen Yuan, Jiaze Wang, Qi Dou, Jinpeng Li, Pheng-Ann Heng

TL;DR
This paper introduces a new dataset and models to improve medical vision-language models' ability to understand and analyze multiple medical images, addressing a key gap in multi-image clinical reasoning.
Contribution
The paper presents the Med-MIM dataset and fine-tuned models, advancing multi-image understanding in medical LVLMs beyond single-image capabilities.
Findings
Models fine-tuned on Med-MIM outperform existing LVLMs on multi-image tasks.
Med-MIM dataset effectively enhances multi-image reasoning in medical LVLMs.
Proposed models demonstrate superior performance on the Med-MIM benchmark.
Abstract
Medical large vision-language models (LVLMs) have demonstrated promising performance across various single-image question answering (QA) benchmarks, yet their capability in processing multi-image clinical scenarios remains underexplored. Unlike single image based tasks, medical tasks involving multiple images often demand sophisticated visual understanding capabilities, such as temporal reasoning and cross-modal analysis, which are poorly supported by current medical LVLMs. To bridge this critical gap, we present the Med-MIM instruction dataset, comprising 83.2K medical multi-image QA pairs that span four types of multi-image visual abilities (temporal understanding, reasoning, comparison, co-reference). Using this dataset, we fine-tune Mantis and LLaVA-Med, resulting in two specialized medical VLMs: MIM-LLaVA-Med and Med-Mantis, both optimized for multi-image analysis. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
