Analysis of Wikipedia-based Corpora for Question Answering

Tomasz Jurczyk; Amit Deshmane; Jinho D. Choi

arXiv:1801.02073·cs.CL·February 6, 2018·6 cites

Analysis of Wikipedia-based Corpora for Question Answering

Tomasz Jurczyk, Amit Deshmane, Jinho D. Choi

PDF

Open Access

TL;DR

This paper provides a comprehensive analysis of Wikipedia-based corpora for question answering, examining their intrinsic properties and their effectiveness across different QA tasks, and introduces a new dataset creation method.

Contribution

It offers detailed intrinsic and extrinsic analyses of four Wikipedia-based QA corpora and proposes an indexing method for generating a silver-standard dataset from Wikipedia.

Findings

01

Distinct characteristics of each corpus identified

02

Corpora vary in question types and answer categories

03

Proposed dataset creation improves answer retrieval

Abstract

This paper gives comprehensive analyses of corpora based on Wikipedia for several tasks in question answering. Four recent corpora are collected,WikiQA, SelQA, SQuAD, and InfoQA, and first analyzed intrinsically by contextual similarities, question types, and answer categories. These corpora are then analyzed extrinsically by three question answering tasks, answer retrieval, selection, and triggering. An indexing-based method for the creation of a silver-standard dataset for answer retrieval using the entire Wikipedia is also presented. Our analysis shows the uniqueness of these corpora and suggests a better use of them for statistical question answering learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems