Do Long-Range Language Models Actually Use Long-Range Context?

Simeng Sun; Kalpesh Krishna; Andrew Mattarella-Micke; Mohit Iyyer

arXiv:2109.09115·cs.CL·September 21, 2021·1 cites

Do Long-Range Language Models Actually Use Long-Range Context?

Simeng Sun, Kalpesh Krishna, Andrew Mattarella-Micke, Mohit Iyyer

PDF

Open Access

TL;DR

This paper investigates whether long-range Transformer language models effectively utilize extended context, revealing limited benefits beyond copying and domain-specific advantages, especially in literary texts.

Contribution

The study provides a detailed analysis of long-range Transformer models, showing their limited use of extended context and highlighting domain differences in long-range context utility.

Findings

01

Long-range context improves predictions mainly for copying tokens.

02

Models do not benefit from long context for sentence-level tasks.

03

Long-range context is most helpful for literary novels, less so for textbooks or magazines.

Abstract

Language models are generally trained on short, truncated input sequences, which limits their ability to use discourse-level information present in long-range context to improve their predictions. Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models, which can process much longer sequences than models of the past. However, the ways in which such models take advantage of the long-range context remain unclear. In this paper, we perform a fine-grained analysis of two long-range Transformer language models (including the \emph{Routing Transformer}, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens. Our results reveal that providing long-range context (i.e., beyond the previous 2K tokens) to these models only improves their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Softmax