$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues   via Question Generation and Question Answering

Or Honovich; Leshem Choshen; Roee Aharoni; Ella Neeman; Idan Szpektor,; Omri Abend

arXiv:2104.08202·cs.CL·September 10, 2021·1 cites

$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor,, Omri Abend

PDF

Open Access 1 Repo

TL;DR

This paper introduces $Q^2$, an automatic metric for evaluating factual consistency in knowledge-grounded dialogues, using question generation and answering, which outperforms previous token-based methods in correlation with human judgments.

Contribution

The paper proposes a novel $Q^2$ metric based on question answering and NLI, along with a new annotated dataset for factual consistency in dialogue systems.

Findings

01

$Q^2$ shows higher correlation with human judgments than previous metrics.

02

The curated dataset enables better evaluation of factual consistency in dialogue models.

03

$Q^2$ effectively captures factual correctness through question-answering comparisons.

Abstract

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^{2}$ , compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^{2}$ against other metrics using this dataset and two others, where it consistently shows higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

orhonovich/q-squared
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems