Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Wenhao Yang; Yu Xia; Jinlong Huang; Shiyin Lu; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang; Yuanyu Wan; Lijun Zhang

arXiv:2512.17306·cs.CV·January 8, 2026

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang

PDF

Open Access 2 Datasets

TL;DR

This paper introduces DRIM, a multimodal reasoning model that enhances multi-turn visual reasoning by incorporating self-reflection and correction mechanisms, leading to improved performance on visual understanding tasks.

Contribution

The paper presents a novel multi-stage training pipeline for deep, reliable reasoning with images, including data construction, supervised fine-tuning, and reinforcement learning with redundancy penalties.

Findings

01

DRIM outperforms existing models on visual reasoning benchmarks.

02

The model effectively self-reflects and corrects reasoning errors.

03

Redundancy-penalized policy optimization improves reasoning reliability.

Abstract

Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling