How NOT To Evaluate Your Dialogue System: An Empirical Study of   Unsupervised Evaluation Metrics for Dialogue Response Generation

Chia-Wei Liu; Ryan Lowe; Iulian V. Serban; Michael Noseworthy; Laurent; Charlin; Joelle Pineau

arXiv:1603.08023·cs.CL·January 4, 2017·614 cites

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent, Charlin, Joelle Pineau

PDF

Open Access 2 Repos

TL;DR

This paper critically examines the effectiveness of unsupervised automatic evaluation metrics for dialogue response systems, revealing their weak correlation with human judgments across different domains and highlighting the need for improved metrics.

Contribution

The study provides a comprehensive analysis of existing metrics' weaknesses and offers recommendations for developing more reliable automatic evaluation methods for dialogue systems.

Findings

01

Metrics correlate weakly with human judgments in Twitter domain

02

Metrics do not correlate at all in Ubuntu domain

03

Identifies specific weaknesses in current evaluation metrics

Abstract

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques