TL;DR
MedVR is a reinforcement learning framework that enables medical vision-language models to perform visual reasoning without annotations, improving accuracy and robustness in clinical tasks.
Contribution
MedVR introduces a novel annotation-free visual reasoning method for medical VLMs using entropy-guided regrounding and consensus-based credit assignment.
Findings
Achieves state-of-the-art results on medical VQA benchmarks.
Significantly outperforms existing models without human annotations.
Enhances robustness and transparency for clinical deployment.
Abstract
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
