Improving Dialog Evaluation with a Multi-reference Adversarial Dataset   and Large Scale Pretraining

Ananya B. Sai; Akash Kumar Mohankumar; Siddhartha Arora; Mitesh M.; Khapra

arXiv:2009.11321·cs.CL·September 25, 2020

Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

Ananya B. Sai, Akash Kumar Mohankumar, Siddhartha Arora, Mitesh M., Khapra

PDF

1 Repo

TL;DR

This paper introduces DailyDialog++, a dataset with multiple relevant and adversarial responses for dialog evaluation, and proposes DEB, a BERT-based metric pretrained on large data, to improve robustness and correlation with human judgments.

Contribution

The paper provides a new dataset with multiple references and adversarial responses, and develops a pretrained BERT-based evaluation metric that outperforms existing metrics but still faces challenges with adversarial examples.

Findings

01

Existing metrics struggle to distinguish relevant responses from negatives.

02

Large-scale pretraining improves correlation with human judgments.

03

Adversarial responses significantly reduce metric performance.

Abstract

There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iitmnlp/Dialogue-Evaluation-with-BERT
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.