Analysis of Wikipedia-based Corpora for Question Answering
Tomasz Jurczyk, Amit Deshmane, Jinho D. Choi

TL;DR
This paper provides a comprehensive analysis of Wikipedia-based corpora for question answering, examining their intrinsic properties and their effectiveness across different QA tasks, and introduces a new dataset creation method.
Contribution
It offers detailed intrinsic and extrinsic analyses of four Wikipedia-based QA corpora and proposes an indexing method for generating a silver-standard dataset from Wikipedia.
Findings
Distinct characteristics of each corpus identified
Corpora vary in question types and answer categories
Proposed dataset creation improves answer retrieval
Abstract
This paper gives comprehensive analyses of corpora based on Wikipedia for several tasks in question answering. Four recent corpora are collected,WikiQA, SelQA, SQuAD, and InfoQA, and first analyzed intrinsically by contextual similarities, question types, and answer categories. These corpora are then analyzed extrinsically by three question answering tasks, answer retrieval, selection, and triggering. An indexing-based method for the creation of a silver-standard dataset for answer retrieval using the entire Wikipedia is also presented. Our analysis shows the uniqueness of these corpora and suggests a better use of them for statistical question answering learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
