Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs
Yan Shu, Chi Liu, Robin Chen, Derek Li, Bryan Dai

TL;DR
Fleming-VL is a comprehensive multimodal medical AI framework that unifies understanding of diverse medical data types, achieving state-of-the-art results in various benchmarks through extensive pretraining and fine-tuning strategies.
Contribution
The paper introduces Fleming-VL, a novel unified framework for medical visual reasoning across heterogeneous modalities, addressing domain gaps and data format inconsistencies.
Findings
Achieves state-of-the-art performance on medical VQA, video QA, and 3D image understanding benchmarks.
Effectively integrates long-context data from natural and medical domains for pretraining.
Demonstrates the benefits of combining supervised fine-tuning with group relative policy optimization.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
