Visual Agentic Reinforcement Fine-Tuning

Ziyu Liu; Yuhang Zang; Yushan Zou; Zijian Liang; Xiaoyi Dong; Yuhang Cao; Haodong Duan; Dahua Lin; Jiaqi Wang

arXiv:2505.14246·cs.CV·May 21, 2025

Visual Agentic Reinforcement Fine-Tuning

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Visual-ARFT, a reinforcement fine-tuning method that enhances large vision-language models with agentic abilities like web browsing and image code manipulation, evaluated on new benchmarks.

Contribution

The work presents Visual-ARFT, a novel fine-tuning approach that enables LVLMs to perform real-time web searches and image coding, advancing multi-modal agentic capabilities.

Findings

01

Outperforms baseline by +18.6% F1 / +13.0% EM on MAT-Coding

02

Achieves +29.3 F1% / +25.9% EM gains on multi-hop QA benchmarks

03

Surpasses GPT-4o in agentic multi-modal tasks

Abstract

A key trend in Large Reasoning Models (e.g., OpenAI's o3) is the native agentic ability to use external tools such as web browsers for searching and writing/executing code for image manipulation to think with images. In the open-source research community, while significant progress has been made in language-only agentic abilities such as function calling and tool integration, the development of multi-modal agentic capabilities that involve truly thinking with images, and their corresponding benchmarks, are still less explored. This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs). With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuziyu77/visual-rft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Computational Geometry and Mesh Generation