Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

Sujata Gaihre; Amir Thapa Magar; Prasuna Pokharel; and Laxmi Tiwari

arXiv:2507.14544·cs.CV·July 22, 2025

Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

Sujata Gaihre, Amir Thapa Magar, Prasuna Pokharel, and Laxmi Tiwari

PDF

TL;DR

This paper presents a multimodal AI approach using the Florence model for visual question answering in gastrointestinal endoscopy, demonstrating promising results and establishing a baseline for future clinical applications.

Contribution

It introduces a novel VQA pipeline leveraging the Florence multimodal model with domain-specific augmentations for medical endoscopy images.

Findings

01

Fine-tuning Florence achieves high accuracy on challenge metrics.

02

Domain-specific augmentations improve model generalization.

03

The approach provides a strong baseline for future medical VQA research.

Abstract

This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.