N-gram Language Modeling using Recurrent Neural Network Estimation

Ciprian Chelba; Mohammad Norouzi; Samy Bengio

arXiv:1703.10724·cs.CL·June 21, 2017·33 cites

N-gram Language Modeling using Recurrent Neural Network Estimation

Ciprian Chelba, Mohammad Norouzi, Samy Bengio

PDF

Open Access

TL;DR

This paper explores using LSTM-based neural networks for n-gram language modeling, demonstrating improved performance with longer contexts and practical advantages for certain applications.

Contribution

It introduces LSTM n-gram smoothing, showing its effectiveness for long contexts and practical benefits over traditional models, especially for large-scale data.

Findings

01

LSTM n-gram models outperform traditional smoothing methods for long contexts

02

Performance improves with increasing n-gram order, up to 13

03

LSTM n-gram smoothing is effective at large scale, e.g., One Billion Words benchmark

Abstract

We investigate the effective memory depth of RNN models by using them for $n$ -gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the $n$ -gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM $n$ -gram matches the LSTM LM performance for $n = 9$ and slightly outperforms it for $n = 13$ . When allowing dependencies across sentence boundaries, the LSTM $13$ -gram almost matches the perplexity of the unlimited history LSTM LM. LSTM $n$ -gram smoothing also has the desirable property of improving with increasing $n$ -gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Stochastic Gradient Optimization Techniques · Speech Recognition and Synthesis

MethodsSigmoid Activation · Tanh Activation · Dropout · Long Short-Term Memory