Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective
Tianyi Zhang, David Traum

TL;DR
This paper critiques current evaluation metrics for retrieval-augmented personalized dialogue, highlighting their shortcomings and proposing a cognitively grounded approach aligned with human conversational principles.
Contribution
It re-examines the LAPDOG framework, identifying evaluation limitations and advocating for assessment methods rooted in cognitive and linguistic theories of dialogue.
Findings
Human and LLM judgments align closely
Lexical similarity metrics often diverge from human judgments
Current metrics fail to capture coherence and shared understanding
Abstract
In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning
