Embracing data abundance: BookTest Dataset for Reading Comprehension
Ondrej Bajgar, Rudolf Kadlec, Jan Kleindienst

TL;DR
This paper introduces the BookTest dataset, a significantly larger reading comprehension dataset, demonstrating that training on it substantially improves model accuracy, surpassing previous benchmarks and even human performance in some cases.
Contribution
The paper presents the BookTest dataset, over 60 times larger than previous datasets, enabling better training and performance of reading comprehension models.
Findings
Training on BookTest improves model accuracy significantly.
Ensemble models surpass human baseline on some dataset versions.
Human study indicates further room for improvement.
Abstract
There is a practically unlimited amount of natural language data available. Still, recent work in text comprehension has focused on datasets which are small relative to current computing possibilities. This article is making a case for the community to move to larger data and as a step in that direction it is proposing the BookTest, a new dataset similar to the popular Children's Book Test (CBT), however more than 60 times larger. We show that training on the new data improves the accuracy of our Attention-Sum Reader model on the original CBT test data by a much larger margin than many recent attempts to improve the model architecture. On one version of the dataset our ensemble even exceeds the human baseline provided by Facebook. We then show in our own human study that there is still space for further improvement.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
