DA-Transformer: Distance-aware Transformer

Chuhan Wu; Fangzhao Wu; Yongfeng Huang

arXiv:2010.06925·cs.CL·April 13, 2021

DA-Transformer: Distance-aware Transformer

Chuhan Wu, Fangzhao Wu, Yongfeng Huang

PDF

Open Access

TL;DR

DA-Transformer enhances the standard Transformer by explicitly incorporating real token distances into self-attention, leading to improved performance across multiple NLP benchmarks.

Contribution

It introduces a novel distance-aware mechanism that re-scales self-attention weights using real distances, with learnable parameters and a sigmoid mapping for better context modeling.

Findings

01

Outperforms vanilla Transformer on five benchmarks

02

Effectively captures real token distances

03

Improves task performance with distance-aware attention

Abstract

Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Discriminative Fine-Tuning · Adam · Byte Pair Encoding · Softmax · Layer Normalization · Dense Connections