Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents
Guangfu Guo, Xiaoqian Lu, Yue Feng

TL;DR
Med-VRAgent enhances medical visual reasoning in VLMs by integrating visual guidance, tree search, and reinforcement learning, leading to improved accuracy and consistency in medical visual question answering tasks.
Contribution
The paper introduces Med-VRAgent, a novel framework combining visual guidance, Monte Carlo Tree Search, and reinforcement learning to improve medical visual reasoning in VLMs.
Findings
Outperforms existing methods on multiple medical VQA benchmarks.
Improves reasoning accuracy and reduces hallucinations.
Enhances VLMs through fine-tuning with feedback from the agent.
Abstract
Visual Language Models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbf{Med-VRAgent}). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-VRAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Healthcare · Topic Modeling
