Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu; Chi Liu; Robin Chen; Derek Li; Bryan Dai

arXiv:2511.00916·cs.CV·November 4, 2025

Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai

PDF

Open Access 2 Models

TL;DR

Fleming-VL is a comprehensive multimodal medical AI framework that unifies understanding of diverse medical data types, achieving state-of-the-art results in various benchmarks through extensive pretraining and fine-tuning strategies.

Contribution

The paper introduces Fleming-VL, a novel unified framework for medical visual reasoning across heterogeneous modalities, addressing domain gaps and data format inconsistencies.

Findings

01

Achieves state-of-the-art performance on medical VQA, video QA, and 3D image understanding benchmarks.

02

Effectively integrates long-context data from natural and medical domains for pretraining.

03

Demonstrates the benefits of combining supervised fine-tuning with group relative policy optimization.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling