Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

Sushant Gautam; Michael A. Riegler; and P{\aa}l Halvorsen

arXiv:2506.09958·cs.CV·June 12, 2025

Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy

Sushant Gautam, Michael A. Riegler, and P{\aa}l Halvorsen

PDF

Open Access 3 Models 1 Datasets

TL;DR

Kvasir-VQA-x1 is a large, clinically relevant multimodal dataset for gastrointestinal endoscopy that enhances MedVQA research by including complex, reasoning-focused questions and visual artifacts to improve model robustness.

Contribution

The paper introduces Kvasir-VQA-x1, a significantly expanded GI endoscopy dataset with reasoning-oriented questions and visual augmentations, advancing MedVQA research.

Findings

01

Includes 159,549 new question-answer pairs

02

Supports evaluation of model robustness to visual artifacts

03

Facilitates development of clinically reliable AI systems

Abstract

Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

SimulaMet/Kvasir-VQA-x1
dataset· 7.9k dl
7.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning