Querying GI Endoscopy Images: A VQA Approach

Gaurav Parajuli

arXiv:2507.21165·eess.IV·July 30, 2025

Querying GI Endoscopy Images: A VQA Approach

Gaurav Parajuli

PDF

TL;DR

This paper explores adapting a multimodal language model to answer visual questions about gastrointestinal endoscopy images, aiming to improve diagnostic AI tools in medical imaging.

Contribution

It evaluates the Florence2 model's adaptation for GI endoscopy VQA tasks and assesses its performance with standard NLP metrics.

Findings

01

Model shows potential in medical VQA tasks

02

Standard metrics used for evaluation

03

Highlights challenges in domain-specific VQA

Abstract

VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image. It has enormous potential for the development of medical diagnostic AI systems. Such a system can help clinicians diagnose gastro-intestinal (GI) diseases accurately and efficiently. Although many of the multimodal LLMs available today have excellent VQA capabilities in the general domain, they perform very poorly for VQA tasks in specialized domains such as medical imaging. This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images. We also evaluate the model performance using standard metrics like ROUGE, BLEU and METEOR

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.