$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
Saul Santos, Ant\'onio Farinhas, Daniel C. McNamee, Andr\'e F. T. Martins

TL;DR
$\u2200$-Video introduces a training-free, continuous-time memory system that enables scalable understanding of arbitrarily long videos, improving long-video comprehension without additional training.
Contribution
The paper presents a novel continuous-time long-term memory mechanism that allows processing of unbounded videos efficiently without extra training, enhancing long-video understanding.
Findings
Improved performance on video question-answering tasks.
Efficient processing of arbitrarily long videos.
No additional training required for long-video comprehension.
Abstract
Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, often leading to information loss. This paper introduces -Video, which can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by allowing them to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories that evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Image Processing Techniques
