Do Long-Range Language Models Actually Use Long-Range Context?
Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, Mohit Iyyer

TL;DR
This paper investigates whether long-range Transformer language models effectively utilize extended context, revealing limited benefits beyond copying and domain-specific advantages, especially in literary texts.
Contribution
The study provides a detailed analysis of long-range Transformer models, showing their limited use of extended context and highlighting domain differences in long-range context utility.
Findings
Long-range context improves predictions mainly for copying tokens.
Models do not benefit from long context for sentence-level tasks.
Long-range context is most helpful for literary novels, less so for textbooks or magazines.
Abstract
Language models are generally trained on short, truncated input sequences, which limits their ability to use discourse-level information present in long-range context to improve their predictions. Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models, which can process much longer sequences than models of the past. However, the ways in which such models take advantage of the long-range context remain unclear. In this paper, we perform a fine-grained analysis of two long-range Transformer language models (including the \emph{Routing Transformer}, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens. Our results reveal that providing long-range context (i.e., beyond the previous 2K tokens) to these models only improves their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Softmax
