Why Attend to Everything? Focus is the Key
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Jin Li, Yasin Abbasi Yadkori, Shuai Shao, Changling Liu, Guan Wang, Mingli Yuan, William Chen, Sen Song

TL;DR
Focus introduces a learnable gating mechanism with centroids to selectively attend token pairs, enabling efficient, high-quality attention in pretrained models without degradation and with significant speedups.
Contribution
It presents Focus, a novel method that adds minimal learnable parameters to pretrained models to improve attention efficiency while maintaining or improving performance.
Findings
Focus maintains zero degradation on downstream benchmarks.
Sparse Focus attention outperforms full attention at 124M scale.
Focus achieves up to 8.6x speedup with FlashAttention decomposition.
Abstract
Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into pretrained models, they often degrade perplexity, downstream accuracy, or both. We introduce Focus, a method that learns which token pairs matter. Focus adds a small set of learnable centroids--as few as 148K parameters per layer--that act as gates: only token pairs belonging to the same centroid group attend to each other over long ranges. Focus is composable: it can be added to any pretrained model by training only the centroids while keeping all original weights frozen. Experiments show that composing Focus onto pretrained models yields zero degradation on downstream benchmarks across model sizes from 124M to 70B parameters and five attention architectures. Surprisingly, sparse Focus attention outperforms full attention at 124M scale (30.3 vs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
