Adaptively Sparse Transformers

Gon\c{c}alo M. Correia; Vlad Niculae; Andr\'e F.T. Martins

arXiv:1909.00015·cs.CL·September 9, 2019·19 cites

Adaptively Sparse Transformers

Gon\c{c}alo M. Correia, Vlad Niculae, Andr\'e F.T. Martins

PDF

Open Access 3 Repos

TL;DR

This paper introduces the adaptively sparse Transformer, which uses a differentiable sparsity-inducing attention mechanism to improve interpretability and diversity of attention heads without sacrificing accuracy.

Contribution

It proposes a novel attention mechanism using $oldsymbol{ m oldsymbol{ m oldsymbol{ extalpha}}}$-entmax with learnable sparsity, enabling context-dependent sparse attention in Transformers.

Findings

01

Heads learn different sparsity preferences across layers.

02

Sparsity enhances interpretability and head diversity.

03

No accuracy loss compared to softmax Transformers.

Abstract

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, context-dependent sparsity patterns. This sparsity is accomplished by replacing softmax with $α$ -entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the $α$ parameter -- which controls the shape and sparsity of $α$ -entmax -- allowing attention heads to choose between focused or spread-out…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Interpretability · Adaptively Sparse Transformer · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia?