ReWind: Understanding Long Videos with Instructed Learnable Memory

Anxhelo Diko; Tinghuai Wang; Wassim Swaileh; Shiyan Sun; Ioannis; Patras

arXiv:2411.15556·cs.CV·March 31, 2025

ReWind: Understanding Long Videos with Instructed Learnable Memory

Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, Ioannis, Patras

PDF

Open Access

TL;DR

ReWind is a memory-based vision-language model that efficiently understands long videos by dynamically storing relevant information and selecting key frames for accurate question answering and temporal grounding.

Contribution

ReWind introduces a novel read-perceive-write memory cycle and adaptive frame selection, enabling efficient long video understanding with improved accuracy over prior methods.

Findings

01

+13% VQA score on MovieChat-1K

02

+12% accuracy in temporal grounding

03

Superior performance on long video benchmarks

Abstract

Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. However, existing VLMs struggle with long videos due to computational inefficiency, memory limitations, and difficulties in maintaining coherent understanding across extended sequences. To address these challenges, we introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. ReWind operates in a two-stage framework. In the first stage, ReWind maintains a dynamic learnable memory module with a novel \textbf{read-perceive-write} cycle that stores and updates instruction-relevant visual information as the video unfolds. This module utilizes learnable queries and cross-attentions between memory contents and the input stream, ensuring low memory requirements by scaling linearly with the number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Online Learning and Analytics