TL;DR
This survey reviews 60 English Machine Reading Comprehension datasets, categorizing them by question type and analyzing their characteristics to aid researchers in understanding the landscape of MRC datasets.
Contribution
It provides a comprehensive categorization and comparison of existing MRC datasets, highlighting data sources, question types, and gaps in question diversity.
Findings
Wikipedia is the most common data source.
Few datasets include why, when, and where questions.
Datasets vary widely in size and question form.
Abstract
This paper surveys 60 English Machine Reading Comprehension datasets, with a view to providing a convenient resource for other researchers interested in this problem. We categorize the datasets according to their question and answer form and compare them across various dimensions including size, vocabulary, data source, method of creation, human performance level, and first question word. Our analysis reveals that Wikipedia is by far the most common data source and that there is a relative lack of why, when, and where questions across datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
