WAT: Online Video Understanding Needs Watching Before Thinking
Zifan Han, Hongbo Sun, Jinglin Xu, Canhui Tang, Yulong Lei, Xuchong Zhang, Hongbin Sun, Zhongjiang He, Hao Sun

TL;DR
WAT is a two-stage online video reasoning framework that efficiently processes streaming videos by separating watching and thinking, utilizing hierarchical memory and retrieval mechanisms, and is supported by a new dataset.
Contribution
The paper introduces WAT, a novel two-stage online video reasoning framework with hierarchical memory and retrieval, and presents WAT-85K, a dataset for training and evaluating online video understanding models.
Findings
WAT achieves 77.7% accuracy on StreamingBench.
WAT outperforms existing open-source online Video LLMs.
Operates at real-time frame rates.
Abstract
Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
