TL;DR
This study investigates how different experimental designs affect the consistency of human judgments in evaluating dialogue system outputs, highlighting continuous scales and task factors that improve reliability.
Contribution
It systematically compares various evaluation methods, including a novel Best-Worst scaling approach, to identify optimal practices for dialogue system assessment.
Findings
Continuous scales yield more consistent ratings than Likert or ranking.
Task completion time and prior experience improve rating consistency.
Best-Worst scaling shows promise for dialogue evaluation.
Abstract
To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
