DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong; Chenxiao Zhao; ChengLin Zhu; Weiheng Lu; Guohai Xu; Xing Yu

arXiv:2511.05271·cs.CV·March 12, 2026

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

DeepEyesV2 is a multimodal model capable of understanding and actively invoking external tools like code execution and web search, with a novel two-stage training process and a new benchmark for real-world reasoning tasks.

Contribution

The paper introduces DeepEyesV2, a two-stage training pipeline for agentic multimodal models, and the RealX-Bench benchmark for evaluating complex reasoning involving tool use.

Findings

01

DeepEyesV2 effectively integrates perception, search, and reasoning capabilities.

02

Reinforcement learning refines tool invocation and enables complex tool combinations.

03

DeepEyesV2 demonstrates strong performance on real-world reasoning benchmarks.

Abstract

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The work clearly articulates the challenge of building agentic multimodal models and distinguishes itself from prior “thinking-with-image” paradigms by integrating multiple heterogeneous tools (code + search) in a unified reasoning loop. - The proposed two-stage pipeline (cold-start + RL) is well-motivated and empirically justified, addressing the instability of direct RL for tool learning. - DeepEyesV2 consistently outperforms both general-purpose and tool-augmented baselines, often matching

Weaknesses

- The success of the approach relies heavily on the curated cold-start dataset. Details about data sources, annotation quality, and scalability are somewhat underexplored. - Most benchmarks are vision-centric. It remains unclear how well the approach generalizes to other modalities such as audio, video, or 3D tasks.

Reviewer 02Rating 4Confidence 3

Strengths

1. Good presentaion demonstrates task-adaptive tool invocation (image ops for perception, computation for reasoning) Practical two-stage training approach (cold-start SFT + RL refinement) 2. Task-Adaptive Behavior: The finding that models learn to selectively invoke different tools based on task requirements (image operations for perception, computation for reasoning) is interesting and suggests genuine understanding rather than blind tool use.

Weaknesses

1. Insufficient analysis of what RL learns: The paper lacks detailed examination of how tool-use patterns evolve during training, what new behaviors emerge, and when the model makes mistakes in tool invocation. Learning curves and failure analysis would strengthen the claims. 2. Inadequate efficiency analysis: Missing quantitative data on tool call frequency, success rates, and computational overhead. 3. Unclear generalization to novel tools: Evaluation limited to a fixed tool set. Whether the

Reviewer 03Rating 4Confidence 3

Strengths

- This paper is well written and easy to follow. - The paper goes far beyond reporting final scores by providing deep insights into the learning dynamics. - A discovery of "Adaptive Thinking", where the SFT model over-relies on tools, and the RL stage teaches it the efficiency of when not to use tools . - The performance is well, the authors validate DeepEyesV2 across a wide and diverse range of benchmarks, covering real-world understanding, mathematical reasoning, and search-intensive tasks.

Weaknesses

- A substantive weakness of the paper is its limited methodological novelty, as its core SFT + RL two-stage training paradigm and reasoning CoT data curation are well-established approaches for reasoning-based models, making the work feel more like a high-quality technical report than a novel research contribution. - The paper correctly observes that the SFT model "over-relies on tools" but fails to investigate if this is merely a statistical artifact of the cold-start data's high tool-calling d

Code & Models

Datasets

glowol/RealXBench
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning