Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark

Nouha Dziri; Hannah Rashkin; Tal Linzen; David Reitter

arXiv:2105.00071·cs.CL·June 29, 2022·5 cites

Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark

Nouha Dziri, Hannah Rashkin, Tal Linzen, David Reitter

PDF

Open Access 1 Repo

TL;DR

This paper introduces the BEGIN benchmark to evaluate how well dialogue system responses are grounded in relevant background knowledge, revealing current metrics' limitations and the need for improved evaluation methods.

Contribution

The paper presents the BEGIN benchmark with human-annotated data for assessing attribution in knowledge-grounded dialogue systems, and analyzes the shortcomings of existing evaluation metrics.

Findings

01

Current metrics rely on spurious correlations.

02

Metrics struggle to distinguish attributable responses.

03

Performance degrades with longer knowledge sources.

Abstract

Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google/BEGIN-dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems