TL;DR
Video-Thinker introduces a novel approach enabling multimodal large language models to perform autonomous video reasoning by leveraging intrinsic grounding and captioning capabilities, leading to significant performance improvements.
Contribution
The paper presents Video-Thinker, a new framework with a dedicated dataset and training strategy that advances video reasoning in large language models without external tools.
Findings
Achieves state-of-the-art results on multiple video reasoning benchmarks.
Demonstrates effective autonomous grounding and captioning in video reasoning.
Outperforms existing models like Video-R1 with a 7B parameter size.
Abstract
Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is clearly written, and the specific prompt design for the dataset construction process is also well-explained. The chain-of-thought (CoT) data annotation for video reasoning represents a notable contribution. The phenomena observed during the chain-of-thought training process provide valuable insights.
The paper's technical contribution is limited. The CoT annotations for video labeling primarily rely on the capabilities of the DeepSeek and Gemini models. The training process of Video-Thinker-7B lacks contrution, as it mainly adopts the conventional approach of SFT+GRPO.
1. The paper provides an extensive comparison with several contemporaneous approaches such as Video-R1, Temporal-R1, and Time-R1. This helps readers clearly understand the distinctions and advantages of the proposed method under a similar technical framework (GRPO), enhancing the paper’s contextual clarity. 2. The authors propose a new dataset tailored for video understanding and reasoning tasks, which effectively improves the efficiency and stability of reinforcement learning (RL) training. Th
1. Although the paper compares with several GRPO-based methods, the baselines are relatively narrow in scope. Including more competitive and diverse video understanding models would strengthen the claim of GRPO’s effectiveness in video reasoning tasks. 2. The approach supplements reasoning traces using large language models, which may introduce hallucinations or inaccurate information. It remains unclear whether the textual reasoning genuinely contributes to more accurate or meaningful reasonin
The paper introduces a carefully created dataset of annotations yielding strong performance results on several video understanding benchmarks. The code is (or will be) made publicly available.
The paper describes a data engineering approach to improving performance on a variety of video understanding benchmarks. While the performance appears to be strong overall, I do not find the paper particularly scientifically insightful or revealing. Specifically, I am not surprised that for the given set of video benchmark tasks (Video-Holmes, CG-Bench-Reasoning and VRBench), a careful selection of DeepSeek-R1-assisted and Gemini-assisted annotations on a careful selection of existing video data
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
