An Analysis of Neural Language Modeling at Multiple Scales
Stephen Merity, Nitish Shirish Keskar, Richard Socher

TL;DR
This paper demonstrates that extending existing LSTM and QRNN language models to larger vocabularies and character-level granularity can achieve state-of-the-art results efficiently on multiple datasets using minimal computational resources.
Contribution
It shows that simple extensions of current models to larger vocabularies and character-level tasks can match or surpass complex architectures in performance.
Findings
LSTMs and QRNNs achieve state-of-the-art results on character and word-level datasets.
Models are trained efficiently within 12 hours to 2 days on a single GPU.
Extending models to larger vocabularies and character granularity is effective.
Abstract
Many of the leading approaches in language modeling introduce novel, complex and specialized architectures. We take existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. When properly tuned, LSTMs and QRNNs achieve state-of-the-art results on character-level (Penn Treebank, enwik8) and word-level (WikiText-103) datasets, respectively. Results are obtained in only 12 hours (WikiText-103) to 2 days (enwik8) using a single modern GPU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
