Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Yuanchen Bei; Tianxin Wei; Xuying Ning; Yanjun Zhao; Zhining Liu; Xiao Lin; Yada Zhu; Hendrik Hamann; Jingrui He; Hanghang Tong

arXiv:2601.03515·cs.CL·January 8, 2026

Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong

PDF

Open Access 1 Datasets

TL;DR

Mem-Gallery introduces a comprehensive benchmark for evaluating multimodal long-term conversational memory in MLLM agents, addressing gaps in existing assessments by focusing on long-term, multi-session multimodal interactions.

Contribution

The paper presents Mem-Gallery, a new benchmark dataset and evaluation framework for assessing long-term multimodal memory in MLLM agents, highlighting key challenges and limitations.

Findings

01

Explicit multimodal memory retention is essential.

02

Current models struggle with memory reasoning.

03

Memory organization impacts long-term performance.

Abstract

Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Ethan-Bei/Mem-Gallery
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems