TL;DR
This paper introduces ADEM, a learned evaluation model that predicts human-like scores for dialogue responses, significantly outperforming traditional metrics and generalizing to unseen models, thus advancing automatic dialogue assessment.
Contribution
The paper presents ADEM, a novel learned evaluation model trained on human scores, improving correlation with human judgments and generalizing across dialogue models.
Findings
ADEM's predictions correlate strongly with human judgments.
ADEM outperforms word-overlap metrics like BLEU.
ADEM generalizes to unseen dialogue models.
Abstract
Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
