VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Shuming Liu; Mingchen Zhuge; Changsheng Zhao; Jun Chen; Lemeng Wu; Zechun Liu; Chenchen Zhu; Zhipeng Cai; Chong Zhou; Haozhe Liu; Ernie Chang; Saksham Suri; Hongyu Xu; Qi Qian; Wei Wen; Balakrishnan Varadarajan; Zhuang Liu; Hu Xu; Florian Bordes; Raghuraman Krishnamoorthi; Bernard Ghanem; Vikas Chandra; Yunyang Xiong

arXiv:2601.05175·cs.CV·March 24, 2026

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi

PDF

Open Access

TL;DR

VideoAuto-R1 introduces a selective reasoning approach for video understanding that improves efficiency and accuracy by reasoning only when necessary, reducing computational costs compared to traditional chain-of-thought methods.

Contribution

The paper proposes a novel reason-when-necessary framework that adaptively applies reasoning, achieving state-of-the-art results with fewer tokens and demonstrating when explicit reasoning is beneficial.

Findings

01

State-of-the-art accuracy on video QA and grounding benchmarks.

02

Reduces response length by approximately 3.3 times.

03

Lower reasoning activation on perception tasks, higher on reasoning tasks.

Abstract

Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis