How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent, Charlin, Joelle Pineau

TL;DR
This paper critically examines the effectiveness of unsupervised automatic evaluation metrics for dialogue response systems, revealing their weak correlation with human judgments across different domains and highlighting the need for improved metrics.
Contribution
The study provides a comprehensive analysis of existing metrics' weaknesses and offers recommendations for developing more reliable automatic evaluation methods for dialogue systems.
Findings
Metrics correlate weakly with human judgments in Twitter domain
Metrics do not correlate at all in Ubuntu domain
Identifies specific weaknesses in current evaluation metrics
Abstract
We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
