TL;DR
This paper dissects chunk-based sparse attention models to identify key architectural principles that enable effective length generalization in language models, achieving state-of-the-art results in extreme context extrapolation.
Contribution
It systematically uncovers three core design principles crucial for length generalization in sparse attention models, guiding future development of long-context language models.
Findings
Identified three critical architectural components for length generalization.
Achieved state-of-the-art length extrapolation from 4K to 32 million tokens.
Provided theoretical insights into intra-chunk information processing.
Abstract
Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
