TL;DR
This paper develops a theory linking LSTM memory gating to power law decay in language dependencies, showing that models learn and can benefit from explicitly imposing a multi-timescale distribution, enhancing performance and interpretability.
Contribution
It introduces a theoretical framework connecting LSTM forget gate biases to power law decay, and demonstrates benefits of explicitly modeling this distribution in training language models.
Findings
LSTM units' timescales follow an Inverse Gamma distribution.
Training with the theoretical distribution improves perplexity, especially for rare words.
Explicit multi-timescale modeling enhances interpretability of language models.
Abstract
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. Earlier work has demonstrated that dependencies in natural language tend to decay with distance between words according to a power law. However, it is unclear how this knowledge can be used for analyzing or designing neural network language models. In this work, we derived a theory for how the memory gating mechanism in long short-term memory (LSTM) language models can capture power law decay. We found that unit timescales within an LSTM, which are determined by the forget gate bias, should follow an Inverse Gamma distribution. Experiments then showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution. Further, we found that explicitly imposing the theoretical distribution upon the model during training yielded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
