Transformers Learn Faster with Semantic Focus
Parikshit Ram, Kenneth L. Clarkson, Tim Klinger, Shashanka Ubaru, Alexander G. Gray

TL;DR
This paper investigates how input-dependent sparse attention in transformers enhances learnability and generalization, demonstrating faster convergence and better performance through empirical and theoretical analysis of semantic focus mechanisms.
Contribution
It provides a theoretical framework explaining why input-dependent sparse attention improves learning, contrasting it with input-agnostic methods, supported by empirical validation.
Findings
Input-dependent sparse attention converges faster.
Semantic focus improves generalization.
Theoretical analysis links stability to convergence.
Abstract
Various forms of sparse attention have been explored to mitigate the quadratic computational and memory cost of the attention mechanism in transformers. We study sparse transformers not through a lens of efficiency but rather in terms of learnability and generalization. Empirically studying a range of attention mechanisms, we find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models, while input-agnostic sparse attention models show no such benefits -- a phenomenon that is robust across architectural and optimization hyperparameter choices. This can be interpreted as demonstrating that concentrating a model's "semantic focus" with respect to the tokens currently being considered (in the form of input-dependent sparse attention) accelerates learning. We develop a theoretical characterization of the conditions that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Machine Learning in Materials Science · Topic Modeling
MethodsSoftmax · Focus
