Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Shijue Huang,Hangyu Guo,Chenxin Li,Junting Lu,Xinyu Geng,Zhaochen Su,Zhenyu Li,Shuang Chen,Hongru Wang,Yi R. Fung

arXiv:2605.10832·cs.CL·May 12, 2026

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Shijue Huang,Hangyu Guo,Chenxin Li,Junting Lu,Xinyu Geng,Zhaochen Su,Zhenyu Li,Shuang Chen,Hongru Wang,Yi R. Fung

PDF

1 Repo

TL;DR

This paper introduces a visual-native agent with an image bank protocol and an on-policy data evolution framework, significantly improving multimodal deep search performance across multiple benchmarks.

Contribution

It proposes a novel image bank reference protocol and a closed-loop data refinement method that enhances training data relevance and reusability for multimodal agents.

Findings

01

ODE improves Qwen3-VL-8B from 24.9% to 39.0% accuracy.

02

ODE surpasses Gemini-2.5 Pro in standard setting (37.9%).

03

Image-bank reuse benefits complex iterative visual tasks.

Abstract

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joeying1019/ODE
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.