VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Zeyi Huang; Yuyang Ji; Anirudh Sundara Rajan; Zefan Cai; Wen Xiao; Haohan Wang; Junjie Hu; Yong Jae Lee

arXiv:2505.20289·cs.CV·July 22, 2025

VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection

Zeyi Huang, Yuyang Ji, Anirudh Sundara Rajan, Zefan Cai, Wen Xiao, Haohan Wang, Junjie Hu, Yong Jae Lee

PDF

Open Access

TL;DR

VisTA introduces a reinforcement learning framework enabling visual agents to dynamically explore and select tools from a diverse library, significantly improving reasoning performance and generalization on visual question-answering benchmarks.

Contribution

It presents a novel RL-based approach for active tool selection in visual reasoning, overcoming limitations of prior prompting and fine-tuning methods.

Findings

01

Achieves performance gains over baselines on multiple benchmarks.

02

Enhances generalization to out-of-distribution examples.

03

Demonstrates effective tool utilization and adaptive reasoning.

Abstract

We introduce VisTA, a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance. Existing methods for tool-augmented reasoning either rely on training-free prompting or large-scale fine-tuning; both lack active tool exploration and typically assume limited tool diversity, and fine-tuning methods additionally demand extensive human supervision. In contrast, VisTA leverages end-to-end reinforcement learning to iteratively refine sophisticated, query-specific tool selection strategies, using task outcomes as feedback signals. Through Group Relative Policy Optimization (GRPO), our framework enables an agent to autonomously discover effective tool-selection pathways without requiring explicit reasoning supervision. Experiments on the ChartQA, Geometry3K, and BlindTest benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications

MethodsLib