MMRC: A Large-Scale Benchmark for Understanding Multimodal Large   Language Model in Real-World Conversation

Haochen Xue; Feilong Tang; Ming Hu; Yexin Liu; Qidong Huang; Yulong; Li; Chengzhi Liu; Zhongxing Xu; Chong Zhang; Chun-Mei Feng; Yutong Xie; Imran; Razzak; Zongyuan Ge; Jionglong Su; Junjun He; Yu Qiao

arXiv:2502.11903·cs.CL·March 11, 2025

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation

Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong, Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran, Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao

PDF

Open Access

TL;DR

This paper introduces MMRC, a comprehensive benchmark for evaluating multimodal large language models in real-world conversations, highlighting their limitations and proposing a note-taking strategy to improve performance.

Contribution

The paper presents MMRC, a large-scale real-world conversation benchmark, and proposes a note-taking method to enhance MLLMs' conversational abilities.

Findings

01

MLLMs show accuracy drops in real-world scenarios.

02

Identified failure patterns include memory degradation and error propagation.

03

Note-taking strategy improves model performance.

Abstract

Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems