MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Xiyu Ren; Zhaowei Wang; Yiming Du; Zhongwei Xie; Chi Liu; Xinlin Yang; Haoyue Feng; Wenjun Pan; Tianshi Zheng; Baixuan Xu; Zhengnan Li; Yangqiu Song; Ginny Wong; Simon See

arXiv:2605.14906·cs.CV·May 15, 2026

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See

PDF

1 Repo

TL;DR

MemLens introduces a comprehensive benchmark for evaluating multimodal long-term memory in large vision-language models, highlighting the strengths and limitations of current approaches and suggesting hybrid solutions.

Contribution

We present MEMLENS, a new benchmark for multimodal memory in multi-session conversations, and evaluate 27 LVLMs and 7 memory-augmented agents on it.

Findings

01

Long-context LVLMs excel in short-term accuracy but decline with longer conversations.

02

Memory-augmented agents maintain length stability but lose visual fidelity over time.

03

Most systems struggle with multi-session reasoning, achieving below 30% accuracy.

Abstract

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xrenaf/MEMLENS
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.