TL;DR
MemLens introduces a comprehensive benchmark for evaluating multimodal long-term memory in large vision-language models, highlighting the strengths and limitations of current approaches and suggesting hybrid solutions.
Contribution
We present MEMLENS, a new benchmark for multimodal memory in multi-session conversations, and evaluate 27 LVLMs and 7 memory-augmented agents on it.
Findings
Long-context LVLMs excel in short-term accuracy but decline with longer conversations.
Memory-augmented agents maintain length stability but lose visual fidelity over time.
Most systems struggle with multi-session reasoning, achieving below 30% accuracy.
Abstract
Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
