Towards a Metric for Automated Conversational Dialogue System Evaluation   and Improvement

Jan Deriu; Mark Cieliebak

arXiv:1909.12066·cs.AI·June 26, 2020

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Jan Deriu, Mark Cieliebak

PDF

TL;DR

AutoJudge is an automated evaluation method for conversational dialogue systems that uses self-talk generated dialogues and human ratings to train a judgement model, showing good correlation with human assessments and potential for system improvement.

Contribution

The paper introduces AutoJudge, a novel automated evaluation metric for dialogue systems that leverages self-talk and human ratings, and explores its application for system re-ranking and reinforcement learning.

Findings

01

AutoJudge correlates well with human ratings.

02

AutoJudge effectively re-ranks candidate utterances.

03

AutoJudge cannot currently be used as a reward in reinforcement learning.

Abstract

We present "AutoJudge", an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.