Learning to Focus: Focal Attention for Selective and Scalable Transformers
Dhananjay Ram, Wei Xia, Stefano Soatto

TL;DR
Focal Attention enhances transformer models by sharpening attention distributions, leading to better focus on relevant tokens, improved scalability, and significant performance gains on long-context tasks with fewer parameters and less data.
Contribution
This paper introduces Focal Attention, a novel method that sharpens attention distributions in transformers, improving scalability and performance especially on long-context tasks.
Findings
Achieves up to 42% fewer parameters for the same accuracy.
Reduces training data requirements by up to 33%.
Improves long-context task performance by 17% to 82%.
Abstract
Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective feature selection at every layer of these models, particularly for long contexts. We propose Focal Attention, a simple yet effective modification that sharpens the attention distribution by controlling the softmax temperature, either as a fixed hyperparameter or as a learnable parameter during training. This sharpening enables the model to concentrate on the most relevant tokens while suppressing irrelevant ones. Empirically, Focal Attention scales more favorably than standard transformer with respect to model size, training data, and context length. Across diverse benchmarks, it achieves the same accuracy with up to 42% fewer parameters or 33% less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Reservoir Computing · Parallel Computing and Optimization Techniques
