Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu; Kaiwen Xiong; Peng Xia; Yiyang Zhou; Haonian Ji; Lu Feng; Siwei Han; Mingyu Ding; Huaxiu Yao

arXiv:2511.19900·cs.CV·November 27, 2025

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

PDF

Open Access

TL;DR

Agent0-VL is a self-evolving vision-language model that uses tool-integrated reasoning and self-verification to continually improve without human supervision, demonstrating significant performance gains.

Contribution

It introduces a novel self-evolving framework combining tool-based reasoning and verification, enabling continual self-improvement without external rewards or annotations.

Findings

01

Achieves 12.5% improvement on geometric problem solving.

02

Utilizes tool-grounded critique for self-reward and verification.

03

Operates without human annotations or external reward models.

Abstract

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling