DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu

TL;DR
DeepEyes is a reinforcement learning-based model that learns to integrate visual information into reasoning processes, improving multimodal understanding and reasoning without pre-collected data, mirroring human cognition.
Contribution
We introduce DeepEyes, a novel reinforcement learning framework enabling models to learn visual reasoning intrinsically without relying on external data or models.
Findings
Significant performance improvements on perception and reasoning benchmarks.
Enhanced grounding, reduced hallucination, and better mathematical reasoning.
Emergence of human-like visual reasoning patterns.
Abstract
Large Vision-Language Models excel at multimodal understanding but struggle to deeply integrate visual information into their predominantly text-based reasoning processes, a key challenge in mirroring human cognition. To address this, we introduce DeepEyes, a model that learns to "think with images", trained end-to-end with reinforcement learning without requiring pre-collected reasoning data for cold-start supervised fine-tuning (SFT). Notably, this ability emerges natively, leveraging the model's own grounding capability as an intrinsic function rather than relying on external specialized models or APIs. We enable this capability through active perception, where the model learns to strategically ground its reasoning in visual information, guided by a tailored data selection and reward strategy. DeepEyes achieves significant performance gains on general perception and reasoning…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is technically solid. The results (Tables 1–9) demonstrate consistent and notable improvements over relevant baselines in high-resolution perception (e.g., HR-Bench-8K), general reasoning, visual grounding, hallucination mitigation, and multimodal mathematical reasoning. The empirical rigor is further supported by comprehensive ablation studies and scaling experiments. 2. The paper conducts an insightful analysis of the training dynamics and emergent reasoning behaviors of **DeepEye
1. Within the broader context of Agentic Reinforcement Learning (RL), this work can be regarded primarily as a **multimodal agents** equipped with a zoom-in tool, contributing limited methodological novelty. 2. The current implementation falls short of fully realizing the concept of **“Thinking with Images.”** This paradigm should encompass not only zoom-in operations but also more diverse capabilities—such as image editing, spatial manipulation, or compositional visual reasoning. 3. **Limited t
1. Combing both the textual and visual cues into the MLLM's intermediate reasoning trajectories sounds like an interesting and critical exploration, despite the openai-o3 has applied the very similar methods to empower the MLLM thinking with images abilities. 2. The fully released pipeline including codes and datas contributes the community which can be a good point for the community to develop the O3 frameworks. 3. The overall pipeline is not that complicated and somehow simple but effective
1. Regarding the iterative MCoT steps, does this paper explore multiple object grounding at the same time? 2. If the model needs more than twice MCoT, then the model needs more than twice visual embedding extraction, which sounds like a computational overhead. Can the authors also provide more inference latency analysis? 3. What if the model predicts all possible grounding objects' bounding boxes and extracts all visual embeddings, then concatenates all these new local visual tokens together,
- Originality: - Proposes an interleaved multimodal CoT (iMCoT) that natively integrates active perception into reasoning with end-to-end RL, avoiding pre-collected reasoning SFT and external specialized models/APIs (Abstract; Sections 1, 3.1). - Conditional reward that only bonuses correct tool-using trajectories effectively incentivizes perception-aware reasoning while discouraging gratuitous tool calls (Section 3.2; Table 5; Figure 4). - A targeted data curation pipeline that filters f
- Technical detail gaps: - Reward magnitudes/normalization, exact coefficients for accuracy/format/tool rewards, and sensitivity analyses are not fully specified (Section 3.2; Table 5 references but no hyperparameters), limiting reproducibility and insight into stability. - Limited description of how image crops are tokenized/encoded, how many visual tokens per crop, and how observation tokens are interleaved with text for different tools (Section 3.2 mentions loss masks but not encoder spec
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Explainable Artificial Intelligence (XAI)
MethodsShrink and Fine-Tune
