Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Yiran Guan; Liang Yin; Dingkang Liang; Jianzhong Ju; Zhenbo Luo; Jian Luan; Yuliang Liu; Xiang Bai

arXiv:2603.12262·cs.CV·March 13, 2026

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

PDF

Open Access

TL;DR

This paper introduces Video Streaming Thinking (VST), a new paradigm enabling VideoLLMs to reason during streaming video, balancing real-time responsiveness with coherent understanding through a novel training pipeline and reasoning mechanism.

Contribution

The paper proposes VST, a paradigm that allows VideoLLMs to think while watching, with a comprehensive training pipeline and reasoning method to improve streaming video understanding.

Findings

01

VST-7B achieves 79.5% on StreamingBench.

02

VST responds 15.7 times faster than Video-R1.

03

VST improves VideoHolmes accuracy by +5.4%.

Abstract

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning