ReWind: Understanding Long Videos with Instructed Learnable Memory
Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, Ioannis, Patras

TL;DR
ReWind is a memory-based vision-language model that efficiently understands long videos by dynamically storing relevant information and selecting key frames for accurate question answering and temporal grounding.
Contribution
ReWind introduces a novel read-perceive-write memory cycle and adaptive frame selection, enabling efficient long video understanding with improved accuracy over prior methods.
Findings
+13% VQA score on MovieChat-1K
+12% accuracy in temporal grounding
Superior performance on long video benchmarks
Abstract
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Online Learning and Analytics
