MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation
Junqing He, Liang Zhu, Rui Wang, Xi Wang, Reza Haffari, Jiaxing Zhang

TL;DR
This paper introduces MADial-Bench, a comprehensive benchmark for evaluating memory-augmented dialogue systems, emphasizing diverse memory recall and human-like response qualities beyond traditional metrics.
Contribution
It creates a novel benchmark based on cognitive science, incorporating new evaluation criteria for memory recall, emotion support, and intimacy in dialogue systems.
Findings
Embedding models show potential for improvement.
Memory injection correlates with emotion support.
Large language models perform well on the benchmark.
Abstract
Long-term memory is important for chatbots and dialogue systems (DS) to create consistent and human-like conversations, evidenced by numerous developed memory-augmented DS (MADS). To evaluate the effectiveness of such MADS, existing commonly used evaluation metrics, like retrieval accuracy and perplexity (PPL), mainly focus on query-oriented factualness and language quality assessment. However, these metrics often lack practical value. Moreover, the evaluation dimensions are insufficient for human-like assessment in DS. Regarding memory-recalling paradigms, current evaluation schemes only consider passive memory retrieval while ignoring diverse memory recall with rich triggering factors, e.g., emotions and surroundings, which can be essential in emotional support scenarios. To bridge the gap, we construct a novel Memory-Augmented Dialogue Benchmark (MADail-Bench) covering various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems · Context-Aware Activity Recognition Systems · Intelligent Tutoring Systems and Adaptive Learning
MethodsFocus
