Source Dependency-Aware Transformer with Supervised Self-Attention
Chengyi Wang, Shuangzhi Wu, Shujie Liu

TL;DR
This paper introduces a dependency-aware Transformer model that explicitly incorporates source dependency trees into self-attention, leading to improved translation quality across multiple language pairs.
Contribution
It proposes a novel supervised self-attention mechanism using dependency trees, enhancing the Transformer without needing pre-parsed input during inference.
Findings
Significant translation improvements on Chinese-English, English-Japanese, and English-German tasks.
Supervised attention heads effectively learn source dependency relations.
Model outperforms baseline Transformer in multiple translation benchmarks.
Abstract
Recently, Transformer has achieved the state-of-the-art performance on many machine translation tasks. However, without syntax knowledge explicitly considered in the encoder, incorrect context information that violates the syntax structure may be integrated into source hidden states, leading to erroneous translations. In this paper, we propose a novel method to incorporate source dependencies into the Transformer. Specifically, we adopt the source dependency tree and define two matrices to represent the dependency relations. Based on the matrices, two heads in the multi-head self-attention module are trained in a supervised manner and two extra cross entropy losses are introduced into the training objective function. Under this training objective, the model is trained to learn the source dependency relations directly. Without requiring pre-parsed input during inference, our model can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
