WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
Yulin Zhang, Cheng Shi, Sibei Yang

TL;DR
WeaveTime enhances Video-LLMs for streaming video analysis by teaching models to understand temporal order and focus dynamically on relevant past information, improving accuracy and efficiency in online scenarios.
Contribution
It introduces a novel, model-agnostic framework with a lightweight training objective and dynamic focus mechanism for streaming Video-LLMs, addressing core limitations of time-agnosticism.
Findings
Improves accuracy on streaming video benchmarks
Reduces latency in video processing
Enhances temporal reasoning in Video-LLMs
Abstract
Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
