Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

TL;DR
This paper introduces FlexMem, a training-free visual memory mechanism that enables multimodal large language models to understand arbitrarily long videos by mimicking human memory recall, significantly improving efficiency and performance.
Contribution
The paper proposes FlexMem, a novel visual memory mechanism that allows MLLMs to process infinite-length videos without training, enhancing understanding and performance on long video tasks.
Findings
FlexMem enables processing over 1,000 frames on a single GPU.
It improves performance of existing MLLMs on long video benchmarks.
FlexMem achieves comparable or better results than SOTA models like GPT-4o and Gemini-1.5 Pro.
Abstract
Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
