Simple Local Attentions Remain Competitive for Long-Context Tasks

Wenhan Xiong; Barlas O\u{g}uz; Anchit Gupta; Xilun Chen; Diana; Liskovich; Omer Levy; Wen-tau Yih; Yashar Mehdad

arXiv:2112.07210·cs.CL·May 5, 2022

Simple Local Attentions Remain Competitive for Long-Context Tasks

Wenhan Xiong, Barlas O\u{g}uz, Anchit Gupta, Xilun Chen, Diana, Liskovich, Omer Levy, Wen-tau Yih, Yashar Mehdad

PDF

Open Access 1 Repo

TL;DR

This study thoroughly compares various long-range attention models in NLP, revealing that simple local window attention often outperforms more complex variants in practical long-context tasks, with implications for model efficiency.

Contribution

It provides a large-scale, controlled experimental analysis of long-range attention variants, showing local attention can match or outperform complex methods in real-world tasks.

Findings

01

Simple local window attention matches Longformer performance.

02

Overlap in attention windows is unnecessary for good results.

03

Complex attention variants do not outperform simple local attention.

Abstract

Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pytorch/fairseq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications