Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

TL;DR
This paper introduces Video Streaming Thinking (VST), a new paradigm enabling VideoLLMs to reason during streaming video, balancing real-time responsiveness with coherent understanding through a novel training pipeline and reasoning mechanism.
Contribution
The paper proposes VST, a paradigm that allows VideoLLMs to think while watching, with a comprehensive training pipeline and reasoning method to improve streaming video understanding.
Findings
VST-7B achieves 79.5% on StreamingBench.
VST responds 15.7 times faster than Video-R1.
VST improves VideoHolmes accuracy by +5.4%.
Abstract
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
