VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Hongbo Jin; Qingyuan Wang; Wenhao Zhang; Yang Liu; Sijie Cheng

arXiv:2512.04540·cs.CV·December 17, 2025

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng

PDF

Open Access

TL;DR

VideoMem introduces an adaptive memory management framework for ultra-long video understanding, enabling models to effectively retain critical information over long sequences and outperform existing methods on various benchmarks.

Contribution

The paper presents VideoMem, a novel approach that models long video understanding as a sequential generation task with dynamic memory updates and a new training algorithm, PRPO.

Findings

01

Outperforms existing models on multiple ultra-long video benchmarks

02

Efficiently retains critical long-term information with adaptive memory

03

Accelerates training convergence through novel reward strategies

Abstract

Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition