GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Bo Liu; Xiangyu Zhao; Along He; Yidi Chen; Huazhu Fu; Xiao-Ming Wu

arXiv:2506.17939·cs.CV·October 29, 2025

GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu

PDF

1 Datasets

TL;DR

This paper introduces GEMeX-RMCoT, a new medical visual question answering dataset with region-aware reasoning steps and a reinforcement learning reward mechanism, enhancing interpretability and efficiency in clinical decision support models.

Contribution

It presents a novel region-aware multimodal reasoning dataset and a reinforcement learning method to improve interpretability and data efficiency in medical VQA models.

Findings

01

Achieves comparable performance with only 12.5% of training data

02

Provides fine-grained visual region grounding for explainability

03

Enhances model reliability and interpretability

Abstract

Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BoKelvin/GEMeX-ThinkVG
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.