TL;DR
This paper introduces a learnable, coarse-to-fine sparse attention mechanism using gist compression tokens that improves long-context processing in language models, outperforming existing methods.
Contribution
It proposes a novel end-to-end trainable framework combining gist compression and selective unfolding for efficient long-context attention.
Findings
Outperforms other compression baselines on LongBench and RAG benchmarks.
Achieves logarithmic complexity in multi-resolution context access.
Effective at compression ratios from 8x to 32x.
Abstract
Scaling large language models to long contexts is challenging due to the quadratic computational cost of full attention. Mitigation approaches include KV-cache selection or compression techniques. We instead provide an effective and end-to-end learnable bridge between the two without requiring architecture modification. In particular, our key insight is that interleaved gist compression tokens -- which provide a learnable summary of sets of raw tokens -- can serve as routing signals for sparse attention. Building on this, we introduce selective unfolding via GSA, which first compresses the context into gist tokens, then selects the most relevant gists, and subsequently restores the corresponding raw chunks for detailed attention. This yields a simple coarse-to-fine mechanism that combines compact global representations with targeted access to fine-grained evidence. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
