MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Ankan Deria; Komal Kumar; Adinath Madhavrao Dukre; Eran Segal; Salman Khan; Imran Razzak

arXiv:2602.06965·cs.CV·March 13, 2026

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak

PDF

Open Access 4 Models

TL;DR

MedMO is a specialized multimodal large language model designed for medical images, trained on domain-specific data, and capable of improved reasoning, grounding, and performance across various medical tasks and modalities.

Contribution

The paper introduces MedMO, a novel medical multimodal foundation model with multi-stage training, domain-specific data, and reinforcement learning for grounded reasoning, surpassing existing medical baselines.

Findings

01

MedMO-8B-Next improves VQA benchmarks by 6.6% on average.

02

MedMO enhances medical report generation by 6.7%.

03

MedMO achieves 56.1 IoU on Bacteria grounding task.

Abstract

Multimodal large language models have advanced rapidly, but their adoption in medicine is constrained by limited domain coverage, imperfect modality alignment, and insufficient grounded reasoning. We introduce MedMO, a medical multimodal foundation model built on a general MLLM architecture and trained exclusively on large-scale domain-specific data. MedMO uses a multi-stage training recipe that includes cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone, instruction tuning with multi-task supervision spanning captioning, VQA, report generation, retrieval, and bounding-box disease localization, and reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU signal to improve spatial grounding and step-by-step reasoning in challenging clinical settings. Across modalities and tasks, MedMO surpasses strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare