Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu, Sun

TL;DR
The paper introduces Explicit Sparse Transformer, a model that enhances attention focus on relevant segments, improving performance and efficiency across NLP and vision tasks.
Contribution
It proposes a novel explicit selection mechanism for sparse attention, leading to better focus and faster inference compared to previous methods.
Findings
Improved model performance on NLP and vision tasks.
Achieved faster inference speed, twice as fast as sparsemax.
Comparable or better results than existing sparse attention methods.
Abstract
Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of irrelevant information in the context. To tackle the problem, we propose a novel model called \textbf{Explicit Sparse Transformer}. Explicit Sparse Transformer is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. Extensive experimental results on a series of natural language processing and computer vision tasks, including neural machine translation, image captioning, and language modeling, all demonstrate the advantages of Explicit Sparse Transformer in model performance. We also show that our proposed sparse attention method achieves comparable or better results than the previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Sparsemax · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia?
