TL;DR
This paper introduces systematic, expert-knowledge-based probes to evaluate what factual and taxonomic knowledge state-of-the-art QA models truly possess, revealing their strengths and limitations in lexical and hierarchical reasoning.
Contribution
It presents a novel methodology for automatically creating controlled knowledge probes from expert sources, enabling comprehensive evaluation of QA models' knowledge understanding.
Findings
QA models recognize some lexical knowledge but struggle with hierarchical reasoning.
Performance drops with increased complexity and distractor answers.
Models show room for improvement in cluster-based semantic evaluations.
Abstract
Open-domain question answering (QA) is known to involve several underlying knowledge and reasoning challenges, but are models actually learning such knowledge when trained on benchmark tasks? To investigate this, we introduce several new challenge tasks that probe whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning, both of which are fundamental to more complex forms of reasoning and are widespread in benchmark datasets. As an alternative to expensive crowd-sourcing, we introduce a methodology for automatically building datasets from various types of expert knowledge (e.g., knowledge graphs and lexical taxonomies), allowing for systematic control over the resulting probes and for a more comprehensive evaluation. We find automatically constructing probes to be vulnerable to annotation artifacts, which we carefully control for.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
