DA-Transformer: Distance-aware Transformer
Chuhan Wu, Fangzhao Wu, Yongfeng Huang

TL;DR
DA-Transformer enhances the standard Transformer by explicitly incorporating real token distances into self-attention, leading to improved performance across multiple NLP benchmarks.
Contribution
It introduces a novel distance-aware mechanism that re-scales self-attention weights using real distances, with learnable parameters and a sigmoid mapping for better context modeling.
Findings
Outperforms vanilla Transformer on five benchmarks
Effectively captures real token distances
Improves task performance with distance-aware attention
Abstract
Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Discriminative Fine-Tuning · Adam · Byte Pair Encoding · Softmax · Layer Normalization · Dense Connections
