ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons
Margaret Li, Jason Weston, Stephen Roller

TL;DR
This paper introduces ACUTE-EVAL, a new dialogue evaluation method that uses optimized questions and multi-turn comparisons to improve the reliability and efficiency of human judgments in dialogue systems.
Contribution
It proposes a novel human evaluation procedure involving pairwise comparisons of full dialogues with focused questions, enhancing robustness and applicability in self-play setups.
Findings
Better correlation with human judgments
More robust and consistent evaluation results
Faster and cheaper testing process
Abstract
While dialogue remains an important end-goal of natural language research, the difficulty of evaluation is an oft-quoted reason why it remains troublesome to make real progress towards its solution. Evaluation difficulties are actually two-fold: not only do automatic metrics not correlate well with human judgments, but also human judgments themselves are in fact difficult to measure. The two most used human judgment tests, single-turn pairwise evaluation and multi-turn Likert scores, both have serious flaws as we discuss in this work. We instead provide a novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment. The questions themselves are optimized to maximize the robustness of judgments across different annotators, resulting in better tests. We also show how these tests work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
