English Machine Reading Comprehension Datasets: A Survey

Daria Dzendzik; Carl Vogel; Jennifer Foster

arXiv:2101.10421·cs.CL·October 11, 2021

English Machine Reading Comprehension Datasets: A Survey

Daria Dzendzik, Carl Vogel, Jennifer Foster

PDF

1 Repo

TL;DR

This survey reviews 60 English Machine Reading Comprehension datasets, categorizing them by question type and analyzing their characteristics to aid researchers in understanding the landscape of MRC datasets.

Contribution

It provides a comprehensive categorization and comparison of existing MRC datasets, highlighting data sources, question types, and gaps in question diversity.

Findings

01

Wikipedia is the most common data source.

02

Few datasets include why, when, and where questions.

03

Datasets vary widely in size and question form.

Abstract

This paper surveys 60 English Machine Reading Comprehension datasets, with a view to providing a convenient resource for other researchers interested in this problem. We categorize the datasets according to their question and answer form and compare them across various dimensions including size, vocabulary, data source, method of creation, human performance level, and first question word. Our analysis reveals that Wikipedia is by far the most common data source and that there is a relative lack of why, when, and where questions across datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dariad/rczoo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.