Improved Language Modeling by Decoding the Past
Siddhartha Brahma

TL;DR
This paper introduces Past Decode Regularization (PDR), a novel method that improves language modeling by encouraging models to retain more contextual information through decoding the last token, leading to state-of-the-art results.
Contribution
The paper proposes PDR, a simple yet effective regularization technique that enhances language models' ability to utilize context without significant computational overhead.
Findings
Achieves state-of-the-art perplexity on Penn Treebank and WikiText-2 datasets.
Improves character-level language modeling performance.
Enhances language model accuracy when combined with mixture-of-softmaxes.
Abstract
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our Past Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax. We also show gains by using PDR in combination with a mixture-of-softmaxes, achieving a word level perplexity of 53.8 and 60.5 on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
