V-Thinker: Interactive Thinking with Images

Runqi Qiao; Qiuna Tan; Minghan Yang; Guanting Dong; Peiqing Yang; Shiqiang Lang; Enhui Wan; Xiaowan Wang; Yida Xu; Lan Yang; Chong Sun; Chen Li; Jing Lyu; Honggang Zhang

arXiv:2511.04460·cs.CV·December 19, 2025

V-Thinker: Interactive Thinking with Images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, Honggang Zhang

PDF

Open Access

TL;DR

V-Thinker is a multimodal reasoning system that uses reinforcement learning and a new benchmark to enable interactive, vision-centric thinking in large models, improving their reasoning capabilities with images.

Contribution

It introduces V-Thinker, a general-purpose multimodal reasoning assistant with a novel data synthesis method and a progressive training curriculum for interactive reasoning.

Findings

01

V-Thinker outperforms baseline models in reasoning tasks.

02

The Data Evolution Flywheel enhances dataset diversity and quality.

03

V-Thinker demonstrates strong performance on the VTBench benchmark.

Abstract

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Social Robot Interaction and HRI