Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang (1); Zhuoran Jin (1); Yupu Hao (1); Yubo Chen (1); Kang Liu (1); Yulong Ao (2); Jun Zhao (1) ((1) The Key Laboratory of Cognition; Decision Intelligence for Complex Systems; Institute of Automation; Chinese Academy of Sciences; Beijing; China; (2) Beijing Academy of Artificial Intelligence (BAAI); Beijing; China)

arXiv:2603.11896·cs.CV·March 13, 2026

Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

Lu Wang (1), Zhuoran Jin (1), Yupu Hao (1), Yubo Chen (1), Kang Liu (1), Yulong Ao (2), Jun Zhao (1) ((1) The Key Laboratory of Cognition, Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China

PDF

Open Access

TL;DR

This paper introduces Think While Watching, a novel streaming video reasoning framework that maintains segment-level memory for improved multi-turn video understanding in multimodal large language models, enabling better online reasoning.

Contribution

It proposes a memory-anchored streaming reasoning method with a multi-stage training strategy and causality enforcement, enhancing multi-turn video reasoning in multimodal LLMs.

Findings

01

Improves single-round accuracy by 2.6% on StreamingBench.

02

Achieves 3.79% higher accuracy on OVO-Bench.

03

Reduces output tokens by 56% in multi-round settings.

Abstract

Multimodal large language models (MLLMs) have shown strong performance on offline video understanding, but most are limited to offline inference or have weak online reasoning, making multi-turn interaction over continuously arriving video streams difficult. Existing streaming methods typically use an interleaved perception-generation paradigm, which prevents concurrent perception and generation and leads to early memory decay as streams grow, hurting long-range dependency modeling. We propose Think While Watching, a memory-anchored streaming video reasoning framework that preserves continuous segment-level memory during multi-turn interaction. We build a three-stage, multi-round chain-of-thought dataset and adopt a stage-matched training strategy, while enforcing strict causality through a segment-level streaming causal mask and streaming positional encoding. During inference, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning