Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context
Urvashi Khandelwal, He He, Peng Qi, Dan Jurafsky

TL;DR
This paper investigates how neural language models utilize context, revealing they focus on recent tokens and treat distant context as rough semantic fields, with caching models enhancing their ability to copy from distant history.
Contribution
The study provides detailed insights into the extent and manner in which LSTM language models use context, highlighting the importance of recent tokens and the role of cache models.
Findings
Models use about 200 tokens of context on average.
Nearby context (within 50 tokens) is highly influential and sensitive to word order.
Distant context beyond 50 tokens is modeled as a semantic field, with cache models improving word copying.
Abstract
We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
