Evaluating Open-Domain Question Answering in the Era of Large Language   Models

Ehsan Kamalloo; Nouha Dziri; Charles L. A. Clarke; Davood Rafiei

arXiv:2305.06984·cs.CL·July 10, 2023·5 cites

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei

PDF

Open Access 1 Repo

TL;DR

This paper critically examines the limitations of lexical matching for open-domain QA evaluation, highlighting the improved performance of LLMs like InstructGPT and proposing alternative evaluation methods including regex and automated models, with human judgment remaining essential.

Contribution

It provides a comprehensive analysis of QA evaluation challenges, demonstrates the inadequacy of lexical matching for LLM answers, and evaluates alternative automated and regex-based methods.

Findings

01

InstructGPT (zero-shot) improves performance by nearly 60%.

02

Over 50% of lexical failures are due to semantically equivalent answers.

03

Regex matching aligns better with human judgments than lexical matching.

Abstract

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ehsk/openqa-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications