Evaluating Open-Domain Question Answering in the Era of Large Language Models
Ehsan Kamalloo, Nouha Dziri, Charles L. A. Clarke, Davood Rafiei

TL;DR
This paper critically examines the limitations of lexical matching for open-domain QA evaluation, highlighting the improved performance of LLMs like InstructGPT and proposing alternative evaluation methods including regex and automated models, with human judgment remaining essential.
Contribution
It provides a comprehensive analysis of QA evaluation challenges, demonstrates the inadequacy of lexical matching for LLM answers, and evaluates alternative automated and regex-based methods.
Findings
InstructGPT (zero-shot) improves performance by nearly 60%.
Over 50% of lexical failures are due to semantically equivalent answers.
Regex matching aligns better with human judgments than lexical matching.
Abstract
Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-open, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
