Adaptively Sparse Transformers
Gon\c{c}alo M. Correia, Vlad Niculae, Andr\'e F.T. Martins

TL;DR
This paper introduces the adaptively sparse Transformer, which uses a differentiable sparsity-inducing attention mechanism to improve interpretability and diversity of attention heads without sacrificing accuracy.
Contribution
It proposes a novel attention mechanism using $oldsymbol{ m oldsymbol{ m oldsymbol{ extalpha}}}$-entmax with learnable sparsity, enabling context-dependent sparse attention in Transformers.
Findings
Heads learn different sparsity preferences across layers.
Sparsity enhances interpretability and head diversity.
No accuracy loss compared to softmax Transformers.
Abstract
Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with -entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the parameter -- which controls the shape and sparsity of -entmax -- allowing attention heads to choose between focused or spread-out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Interpretability · Adaptively Sparse Transformer · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia?
