Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

Sunil Kumar; Bowen Zhao; Leo Dirac; Paulina Varshavskaya

arXiv:2506.14821·cs.LG·August 6, 2025

Reinforcing VLMs to Use Tools for Detailed Visual Reasoning Under Resource Constraints

Sunil Kumar, Bowen Zhao, Leo Dirac, Paulina Varshavskaya

PDF

Open Access

TL;DR

This paper introduces a method to enhance vision-language models' detailed visual reasoning by training smaller models with Group Relative Policy Optimization to effectively utilize external tools like zoom, especially under resource constraints.

Contribution

The paper proposes a novel training approach combining GRPO, simplified tool interfaces, and targeted data to improve VLMs' ability to use external tools for visual reasoning.

Findings

01

Improved performance on visual question-answering tasks.

02

Effective use of external tools enhances detailed visual reasoning.

03

Method outperforms baseline models of similar size.

Abstract

Despite tremendous recent advances in large model reasoning ability, vision-language models (VLMs) still struggle with detailed visual reasoning, especially when compute resources are limited. To address this challenge, we draw inspiration from methods like Deepseek-r1 for VLMs and train smaller-scale models with Group Relative Policy Optimization (GRPO) to use external tools such as zoom. The greatest benefit is obtained with a combination of GRPO learning, a simple reward structure, a simplified tool-calling interface, allocating additional tokens to the result of the tool call, and a training data mix that over-represents visually difficult examples. Compared to similarly-sized baseline models, our method achieves better performance on some visual question-answering (VQA) tasks, thanks to the detailed visual information gathered from the external tool.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Data Visualization and Analytics