TL;DR
This paper systematically compares 23 dialog evaluation metrics across 10 datasets, analyzing their effectiveness at different levels and settings to guide future research in dialog system assessment.
Contribution
It provides the first comprehensive, multi-faceted comparison of recent dialog evaluation metrics, highlighting their strengths, weaknesses, and best practices for assessment.
Findings
Metrics vary significantly in performance across datasets.
Combining metrics can improve evaluation reliability.
Certain metrics perform better at dialog or turn level.
Abstract
Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
