Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Huihan Li; Tianyu Gao; Manan Goenka; Danqi Chen

arXiv:2112.08812·cs.CL·March 23, 2022

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Huihan Li, Tianyu Gao, Manan Goenka, Danqi Chen

PDF

Open Access 2 Repos

TL;DR

This paper critically assesses current conversational QA systems through large-scale human evaluation, revealing significant differences from human-human interactions and proposing improved automatic evaluation methods.

Contribution

It is the first large-scale human evaluation of conversational QA models, highlighting evaluation discrepancies and introducing a question rewriting approach to better align automatic metrics with human judgments.

Findings

01

Human-machine conversations differ greatly from human-human ones.

02

Current evaluations may not accurately reflect real-world performance.

03

Question rewriting improves correlation between automatic and human judgments.

Abstract

Conversational question answering aims to provide natural-language answers to users in information-seeking conversations. Existing conversational QA benchmarks compare models with pre-collected human-human conversations, using ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development and whether current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human machine conversations differs drastically from that of human-human conversations, and there is a disagreement between human and gold-history evaluation in terms of model ranking. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques