Visual Agentic Reinforcement Fine-Tuning
Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

TL;DR
This paper introduces Visual-ARFT, a reinforcement fine-tuning method that enhances large vision-language models with agentic abilities like web browsing and image code manipulation, evaluated on new benchmarks.
Contribution
The work presents Visual-ARFT, a novel fine-tuning approach that enables LVLMs to perform real-time web searches and image coding, advancing multi-modal agentic capabilities.
Findings
Outperforms baseline by +18.6% F1 / +13.0% EM on MAT-Coding
Achieves +29.3 F1% / +25.9% EM gains on multi-hop QA benchmarks
Surpasses GPT-4o in agentic multi-modal tasks
Abstract
A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Computational Geometry and Mesh Generation
