ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and   Multi-turn Comparisons

Margaret Li; Jason Weston; Stephen Roller

arXiv:1909.03087·cs.CL·September 10, 2019·79 cites

ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons

Margaret Li, Jason Weston, Stephen Roller

PDF

Open Access

TL;DR

This paper introduces ACUTE-EVAL, a new dialogue evaluation method that uses optimized questions and multi-turn comparisons to improve the reliability and efficiency of human judgments in dialogue systems.

Contribution

It proposes a novel human evaluation procedure involving pairwise comparisons of full dialogues with focused questions, enhancing robustness and applicability in self-play setups.

Findings

01

Better correlation with human judgments

02

More robust and consistent evaluation results

03

Faster and cheaper testing process

Abstract

While dialogue remains an important end-goal of natural language research, the difficulty of evaluation is an oft-quoted reason why it remains troublesome to make real progress towards its solution. Evaluation difficulties are actually two-fold: not only do automatic metrics not correlate well with human judgments, but also human judgments themselves are in fact difficult to measure. The two most used human judgment tests, single-turn pairwise evaluation and multi-turn Likert scores, both have serious flaws as we discuss in this work. We instead provide a novel procedure involving comparing two full dialogues, where a human judge is asked to pay attention to only one speaker within each, and make a pairwise judgment. The questions themselves are optimized to maximize the robustness of judgments across different annotators, resulting in better tests. We also show how these tests work…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques