Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

Itbaan Safwan; Muhammad Annas Shaikh; Muhammad Haaris; Ramail Khan; Muhammad Atif Tahir

arXiv:2511.04384·cs.CV·November 7, 2025

Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

Itbaan Safwan, Muhammad Annas Shaikh, Muhammad Haaris, Ramail Khan, Muhammad Atif Tahir

PDF

Open Access

TL;DR

This paper introduces a multi-task learning framework using a LoRA-tuned Florence-2 model for medical visual question answering, explanation, and grounding, achieving improved accuracy and interpretability in gastrointestinal VQA.

Contribution

It presents a novel multi-task approach combining VQA, explanation, and grounding with curated datasets, enhancing medical VQA performance and interpretability.

Findings

01

Significant improvement over single-task baselines in answer accuracy.

02

Enhanced visual grounding and interpretability.

03

Effective multi-task learning for medical VQA applications.

Abstract

We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning