Quasar: Datasets for Question Answering by Search and Reading

Bhuwan Dhingra; Kathryn Mazaitis; William W. Cohen

arXiv:1707.03904·cs.CL·August 10, 2017·139 cites

Quasar: Datasets for Question Answering by Search and Reading

Bhuwan Dhingra, Kathryn Mazaitis, William W. Cohen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Quasar introduces two large-scale datasets for question answering that combine search and reading components, challenging systems to retrieve relevant information and accurately extract answers from extensive text corpora.

Contribution

The paper presents novel large-scale datasets, Quasar-S and Quasar-T, designed to evaluate and advance factoid question answering systems combining search and reading tasks.

Findings

01

Baseline models lag behind human performance by 16-32%.

02

Datasets enable evaluation of retrieval and reading comprehension.

03

Open-source datasets available for research.

Abstract

We present two new large-scale datasets aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. The Quasar-S dataset consists of 37000 cloze-style (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. The posts and comments on the website serve as the background corpus for answering the cloze questions. The Quasar-T dataset consists of 43000 open-domain trivia questions and their answers obtained from various internet sources. ClueWeb09 serves as the background corpus for extracting these answers. We pose these datasets as a challenge for two related subtasks of factoid Question Answering: (1) searching for relevant pieces of text that include the correct answer to a query, and (2) reading the retrieved text to answer the query. We also describe a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bdhingra/quasar
noneOfficial

Datasets

sagnikrayc/quasar
dataset· 57 dl
57 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management