Syntax-guided Localized Self-attention by Constituency Syntactic Distance
Shengyuan Hou, Jushi Kai, Haotian Xue, Bingyu Zhu, Bo Yuan, Longtao, Huang, Xinbing Wang, Zhouhan Lin

TL;DR
This paper introduces a syntax-guided localized self-attention mechanism for Transformers that leverages external constituency parsing to improve translation performance across various datasets and languages.
Contribution
It proposes a novel attention mechanism that incorporates external syntactic structures, enhancing Transformer performance without relying solely on data-driven syntactic learning.
Findings
Consistent improvement in translation quality across multiple datasets.
Effective incorporation of external syntactic information.
Enhanced performance with different source languages.
Abstract
Recent works have revealed that Transformers are implicitly learning the syntactic information in its lower layers from data, albeit is highly dependent on the quality and scale of the training data. However, learning syntactic information from data is not necessary if we can leverage an external syntactic parser, which provides better parsing quality with well-defined syntactic structures. This could potentially improve Transformer's performance and sample efficiency. In this work, we propose a syntax-guided localized self-attention for Transformer that allows directly incorporating grammar structures from an external constituency parser. It prohibits the attention mechanism to overweight the grammatically distant tokens over close ones. Experimental results show that our model could consistently improve translation performance on a variety of machine translation datasets, ranging from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Residual Connection
