Relevance of Unsupervised Metrics in Task-Oriented Dialogue for   Evaluating Natural Language Generation

Shikhar Sharma; Layla El Asri; Hannes Schulz; Jeremie Zumer

arXiv:1706.09799·cs.CL·June 30, 2017·183 cites

Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation

Shikhar Sharma, Layla El Asri, Hannes Schulz, Jeremie Zumer

PDF

Open Access 3 Repos

TL;DR

This paper investigates whether unsupervised metrics like BLEU better evaluate task-oriented dialogue responses, finding they correlate more strongly with human judgment in this narrower, lower-diversity setting.

Contribution

The study empirically demonstrates that automated metrics correlate better with human judgments in task-oriented dialogue, especially with multiple references, and highlights the need for more challenging datasets.

Findings

01

Automated metrics show stronger correlation with human judgment in task-oriented dialogue.

02

Metrics correlate better when multiple ground truth references are available.

03

Simple models can solve some existing task-oriented datasets, indicating a need for more challenging benchmarks.

Abstract

Automated metrics such as BLEU are widely used in the machine translation literature. They have also been used recently in the dialogue community for evaluating dialogue response generation. However, previous work in dialogue response generation has shown that these metrics do not correlate strongly with human judgment in the non task-oriented dialogue setting. Task-oriented dialogue responses are expressed on narrower domains and exhibit lower diversity. It is thus reasonable to think that these automated metrics would correlate well with human judgment in the task-oriented setting where the generation task consists of translating dialogue acts into a sentence. We conduct an empirical study to confirm whether this is the case. Our findings indicate that these automated metrics have stronger correlation with human judgments in the task-oriented setting compared to what has been observed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems