Towards an Automatic Turing Test: Learning to Evaluate Dialogue   Responses

Ryan Lowe; Michael Noseworthy; Iulian V. Serban; Nicolas; Angelard-Gontier; Yoshua Bengio; Joelle Pineau

arXiv:1708.07149·cs.CL·January 18, 2018

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas, Angelard-Gontier, Yoshua Bengio, Joelle Pineau

PDF

1 Repo

TL;DR

This paper introduces ADEM, a learned evaluation model that predicts human-like scores for dialogue responses, significantly outperforming traditional metrics and generalizing to unseen models, thus advancing automatic dialogue assessment.

Contribution

The paper presents ADEM, a novel learned evaluation model trained on human scores, improving correlation with human judgments and generalizing across dialogue models.

Findings

01

ADEM's predictions correlate strongly with human judgments.

02

ADEM outperforms word-overlap metrics like BLEU.

03

ADEM generalizes to unseen dialogue models.

Abstract

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mike-n-7/ADEM
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.