Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong; Yurong Mou; Hangcheng Li; Mingzhe Li; Yongzhuo Yang; Ming Zhang; Qiguang Chen; Tianyi Liang; Xiaomeng Hu; Yining Zheng; Xinchi Chen; Jun Zhao; Xuanjing Huang; Xipeng Qiu

arXiv:2511.04570·cs.CV·April 8, 2026

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces 'Thinking with Video', a new paradigm leveraging video generation models for unified multimodal reasoning, supported by the VideoThinkBench benchmark, showing promising results across vision and text tasks.

Contribution

Proposes 'Thinking with Video' as a novel paradigm for multimodal reasoning using video models, and develops VideoThinkBench to evaluate this approach.

Findings

01

Sora-2 achieves SOTA performance on vision-centric tasks, surpassing GPT-5 on eyeballing puzzles.

02

Sora-2 attains 92% accuracy on MATH and 69.2% on MMMU.

03

Self-consistency and in-context learning enhance Sora-2's reasoning abilities.

Abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tongjingqi/Thinking-with-Video
github

Datasets

OpenMOSS-Team/VideoThinkBench
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.