Towards Best Experiment Design for Evaluating Dialogue System Output

Sashank Santhanam; Samira Shaikh

arXiv:1909.10122·cs.CL·September 24, 2019

Towards Best Experiment Design for Evaluating Dialogue System Output

Sashank Santhanam, Samira Shaikh

PDF

1 Repo

TL;DR

This study investigates how different experimental designs affect the consistency of human judgments in evaluating dialogue system outputs, highlighting continuous scales and task factors that improve reliability.

Contribution

It systematically compares various evaluation methods, including a novel Best-Worst scaling approach, to identify optimal practices for dialogue system assessment.

Findings

01

Continuous scales yield more consistent ratings than Likert or ranking.

02

Task completion time and prior experience improve rating consistency.

03

Best-Worst scaling shows promise for dialogue evaluation.

Abstract

To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sashank06/INLG_eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.