Simple Local Attentions Remain Competitive for Long-Context Tasks
Wenhan Xiong, Barlas O\u{g}uz, Anchit Gupta, Xilun Chen, Diana, Liskovich, Omer Levy, Wen-tau Yih, Yashar Mehdad

TL;DR
This study thoroughly compares various long-range attention models in NLP, revealing that simple local window attention often outperforms more complex variants in practical long-context tasks, with implications for model efficiency.
Contribution
It provides a large-scale, controlled experimental analysis of long-range attention variants, showing local attention can match or outperform complex methods in real-world tasks.
Findings
Simple local window attention matches Longformer performance.
Overlap in attention windows is unnecessary for good results.
Complex attention variants do not outperform simple local attention.
Abstract
Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
