VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng

TL;DR
VideoMem introduces an adaptive memory management framework for ultra-long video understanding, enabling models to effectively retain critical information over long sequences and outperform existing methods on various benchmarks.
Contribution
The paper presents VideoMem, a novel approach that models long video understanding as a sequential generation task with dynamic memory updates and a new training algorithm, PRPO.
Findings
Outperforms existing models on multiple ultra-long video benchmarks
Efficiently retains critical long-term information with adaptive memory
Accelerates training convergence through novel reward strategies
Abstract
Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
