Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Shijian Wang; Jiarui Jin; Xingjian Wang; Linxin Song; Runhao Fu; Hecheng Wang; Zongyuan Ge; Yuan Lu; Xuelian Cheng

arXiv:2510.23473·cs.CV·October 28, 2025

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng

PDF

1 Models 3 Reviews

TL;DR

Video-Thinker introduces a novel approach enabling multimodal large language models to perform autonomous video reasoning by leveraging intrinsic grounding and captioning capabilities, leading to significant performance improvements.

Contribution

The paper presents Video-Thinker, a new framework with a dedicated dataset and training strategy that advances video reasoning in large language models without external tools.

Findings

01

Achieves state-of-the-art results on multiple video reasoning benchmarks.

02

Demonstrates effective autonomous grounding and captioning in video reasoning.

03

Outperforms existing models like Video-R1 with a 7B parameter size.

Abstract

Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper is clearly written, and the specific prompt design for the dataset construction process is also well-explained. The chain-of-thought (CoT) data annotation for video reasoning represents a notable contribution. The phenomena observed during the chain-of-thought training process provide valuable insights.

Weaknesses

The paper's technical contribution is limited. The CoT annotations for video labeling primarily rely on the capabilities of the DeepSeek and Gemini models. The training process of Video-Thinker-7B lacks contrution, as it mainly adopts the conventional approach of SFT+GRPO.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper provides an extensive comparison with several contemporaneous approaches such as Video-R1, Temporal-R1, and Time-R1. This helps readers clearly understand the distinctions and advantages of the proposed method under a similar technical framework (GRPO), enhancing the paper’s contextual clarity. 2. The authors propose a new dataset tailored for video understanding and reasoning tasks, which effectively improves the efficiency and stability of reinforcement learning (RL) training. Th

Weaknesses

1. Although the paper compares with several GRPO-based methods, the baselines are relatively narrow in scope. Including more competitive and diverse video understanding models would strengthen the claim of GRPO’s effectiveness in video reasoning tasks. 2. The approach supplements reasoning traces using large language models, which may introduce hallucinations or inaccurate information. It remains unclear whether the textual reasoning genuinely contributes to more accurate or meaningful reasonin

Reviewer 03Rating 4Confidence 3

Strengths

The paper introduces a carefully created dataset of annotations yielding strong performance results on several video understanding benchmarks. The code is (or will be) made publicly available.

Weaknesses

The paper describes a data engineering approach to improving performance on a variety of video understanding benchmarks. While the performance appears to be strong overall, I do not find the paper particularly scientifically insightful or revealing. Specifically, I am not surprised that for the given set of video benchmark tasks (Video-Holmes, CG-Bench-Reasoning and VRBench), a careful selection of DeepSeek-R1-assisted and Gemini-assisted annotations on a careful selection of existing video data

Code & Models

Models

🤗
ShijianW01/Video-Thinker-7B
model· 568 dl· ♡ 3
568 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.