Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints
Sunil Kumar, Bowen Zhao, Leo Dirac, Paulina Varshavskaya

TL;DR
This paper introduces a method to enhance vision-language models' detailed visual reasoning by training smaller models with Group Relative Policy Optimization to effectively utilize external tools like zoom, especially under resource constraints.
Contribution
The paper proposes a novel training approach combining GRPO, simplified tool interfaces, and targeted data to improve VLMs' ability to use external tools for visual reasoning.
Findings
Improved performance on visual question-answering tasks.
Effective use of external tools enhances detailed visual reasoning.
Method outperforms baseline models of similar size.
Abstract
Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Data Visualization and Analytics
