Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Xiang Hu; Zhihao Teng; Jun Zhao; Wei Wu; Kewei Tu

arXiv:2410.01651·cs.CL·June 13, 2025

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

PDF

Open Access 1 Video

TL;DR

This paper introduces Grouped Cross Attention (GCA), a novel attention mechanism that enables Transformers to handle extremely long contexts efficiently by learning to retrieve relevant past chunks, significantly reducing computational costs.

Contribution

The paper presents GCA, a dynamic context attention method that generalizes to 1000 times the training context length with learned retrieval, improving long-range information access.

Findings

01

GCA achieves near-perfect retrieval accuracy for 16M context lengths.

02

GCA maintains constant attention window size while accessing distant information.

03

Models with GCA significantly reduce computational and memory costs.

Abstract

Despite the success of Transformers, handling long contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention. Thus Transformers often require post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 times the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-k relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need