TL;DR
This paper introduces DailyDialog++, a dataset with multiple relevant and adversarial responses for dialog evaluation, and proposes DEB, a BERT-based metric pretrained on large data, to improve robustness and correlation with human judgments.
Contribution
The paper provides a new dataset with multiple references and adversarial responses, and develops a pretrained BERT-based evaluation metric that outperforms existing metrics but still faces challenges with adversarial examples.
Findings
Existing metrics struggle to distinguish relevant responses from negatives.
Large-scale pretraining improves correlation with human judgments.
Adversarial responses significantly reduce metric performance.
Abstract
There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
