Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen; Kun Zhang; Qiong Wu; Xiao Chen; Chao Chang; Xiaoshuai Sun; Yiyi Zhou; Rongrong Ji

arXiv:2603.29252·cs.CV·April 1, 2026

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Tao Chen, Kun Zhang, Qiong Wu, Xiao Chen, Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

PDF

TL;DR

This paper introduces FlexMem, a training-free visual memory mechanism that enables multimodal large language models to understand arbitrarily long videos by mimicking human memory recall, significantly improving efficiency and performance.

Contribution

The paper proposes FlexMem, a novel visual memory mechanism that allows MLLMs to process infinite-length videos without training, enhancing understanding and performance on long video tasks.

Findings

01

FlexMem enables processing over 1,000 frames on a single GPU.

02

It improves performance of existing MLLMs on long video benchmarks.

03

FlexMem achieves comparable or better results than SOTA models like GPT-4o and Gemini-1.5 Pro.

Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.