SimpleBooks: Long-term dependency book dataset with simplified English   vocabulary for word-level language modeling

Huyen Nguyen

arXiv:1911.12391·cs.CL·December 2, 2019

SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

Huyen Nguyen

PDF

Open Access 1 Models

TL;DR

SimpleBooks is a new dataset of 92 million tokens with simplified English vocabulary, designed to facilitate efficient long-term language modeling and architectural research by mimicking properties of larger datasets.

Contribution

It introduces a small, high-frequency vocabulary dataset that matches large datasets in token count, aiding faster and more effective language model training and architecture search.

Findings

01

Contains 92M tokens, comparable to WikiText-103.

02

Has a vocabulary of 98K words, much smaller than WikiText-103.

03

Designed to replicate properties of larger datasets for research convenience.

Abstract

With language modeling becoming the popular base task for unsupervised representation learning in Natural Language Processing, it is important to come up with new architectures and techniques for faster and better training of language models. However, due to a peculiarity of languages -- the larger the dataset, the higher the average number of times a word appears in that dataset -- datasets of different sizes have very different properties. Architectures performing well on small datasets might not perform well on larger ones. For example, LSTM models perform well on WikiText-2 but poorly on WikiText-103, while Transformer models perform well on WikiText-103 but not on WikiText-2. For setups like architectural search, this is a challenge since it is prohibitively costly to run a search on the full dataset but it is not indicative to experiment on smaller ones. In this paper, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
1torriani/xperyv4
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Sigmoid Activation · Tanh Activation · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?