TL;DR
NoiseQA introduces a challenge set to evaluate user-centric question answering systems, emphasizing the impact of real-world noise sources on system performance and highlighting the need for more comprehensive evaluation methods.
Contribution
The paper presents NoiseQA, a new benchmark for assessing QA systems under realistic, noisy user scenarios, revealing significant performance degradation from upstream errors.
Findings
Upstream noise sources significantly affect QA performance.
Current benchmarks overlook real-world user input variability.
There is a need for evaluation methods that consider practical deployment conditions.
Abstract
When Question-Answering (QA) systems are deployed in the real world, users query them through a variety of interfaces, such as speaking to voice assistants, typing questions into a search engine, or even translating questions to languages supported by the QA system. While there has been significant community attention devoted to identifying correct answers in passages assuming a perfectly formed question, we show that components in the pipeline that precede an answering engine can introduce varied and considerable sources of error, and performance can degrade substantially based on these upstream noise sources even for powerful pre-trained QA models. We conclude that there is substantial room for progress before QA systems can be effectively deployed, highlight the need for QA evaluation to expand to consider real-world use, and hope that our findings will spur greater community…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
