Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy
Sushant Gautam, Michael A. Riegler, and P{\aa}l Halvorsen

TL;DR
Kvasir-VQA-x1 is a large, clinically relevant multimodal dataset for gastrointestinal endoscopy that enhances MedVQA research by including complex, reasoning-focused questions and visual artifacts to improve model robustness.
Contribution
The paper introduces Kvasir-VQA-x1, a significantly expanded GI endoscopy dataset with reasoning-oriented questions and visual augmentations, advancing MedVQA research.
Findings
Includes 159,549 new question-answer pairs
Supports evaluation of model robustness to visual artifacts
Facilitates development of clinically reliable AI systems
Abstract
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
