TL;DR
This paper enhances medical VQA by generating reasoning trajectories, introducing a trajectory-aware reward, and demonstrating improved accuracy through a novel training framework that emphasizes process supervision.
Contribution
It proposes a new two-stage training framework with trajectory-aware rewards for medical VQA, leveraging reasoning trajectories and process supervision to improve reasoning capabilities.
Findings
Trajectory-aware reward improves accuracy from 0.598 to 0.689.
Combining DTW-based process reward with exact-match reward enhances BERTScore and ROUGE-L.
Generated reasoning datasets and code are publicly available.
Abstract
Reasoning capabilities are crucial for reliable medical visual question answering (VQA); however, existing datasets rarely include reasoning explanations. We address this by generating reasoning trajectories for six medical VQA benchmarks using the COMCTS algorithm with open-source vision-language models, with an LLM serving as the verification judge. Building on these generated datasets, we propose a two-stage training framework: supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) with a novel process-based reward. While standard approaches rely solely on exact-match rewards for final answers, we introduce a trajectory-aware reward that measures the similarity between generated and ground-truth reasoning processes. Specifically, we embed reasoning steps using sentence transformers and compute the Dynamic Time Warping (DTW) distance between the resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
