Regularizing and Optimizing LSTM Language Models

Stephen Merity; Nitish Shirish Keskar; Richard Socher

arXiv:1708.02182·cs.CL·August 9, 2017·468 cites

Regularizing and Optimizing LSTM Language Models

Stephen Merity, Nitish Shirish Keskar, Richard Socher

PDF

Open Access 5 Repos 1 Models

TL;DR

This paper introduces new regularization and optimization techniques for LSTM language models, achieving state-of-the-art perplexities on standard datasets through weight dropout and a novel averaging method.

Contribution

It proposes the weight-dropped LSTM with DropConnect and NT-ASGD, improving regularization and optimization for language modeling.

Findings

01

Achieved state-of-the-art perplexity of 57.3 on Penn Treebank

02

Achieved state-of-the-art perplexity of 65.8 on WikiText-2

03

Further improved perplexity to 52.8 and 52.0 with neural cache

Abstract

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
nosdigitalmedia/dutch-youth-comment-classifier
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms

MethodsNeural Cache · Dropout · Sigmoid Activation · Tanh Activation · Embedding Dropout · Variational Dropout · Weight Tying · Temporal Activation Regularization · Activation Regularization · Non-monotonically Triggered ASGD