Promoting the Knowledge of Source Syntax in Transformer NMT Is Not Needed
Thuong-Hai Pham, Dominik Mach\'a\v{c}ek, Ond\v{r}ej Bojar

TL;DR
This paper investigates whether promoting source syntax knowledge in Transformer-based neural machine translation improves performance, finding that simple data manipulations are ineffective and that self-attention can inherently capture syntactic structure without explicit linguistic input.
Contribution
The study demonstrates that explicit source syntax promotion offers limited benefits in large data Transformer NMT models, highlighting the self-attention mechanism's ability to learn syntax implicitly.
Findings
Data manipulation techniques are ineffective in large data settings.
Self-attention can naturally grasp syntactic structures without explicit guidance.
Trivial linear trees yield similar gains as true dependency trees, suggesting a regularization effect.
Abstract
The utility of linguistic annotation in neural machine translation seemed to had been established in past papers. The experiments were however limited to recurrent sequence-to-sequence architectures and relatively small data settings. We focus on the state-of-the-art Transformer model and use comparably larger corpora. Specifically, we try to promote the knowledge of source-side syntax using multi-task learning either through simple data manipulation techniques or through a dedicated model component. In particular, we train one of Transformer attention heads to produce source-side dependency tree. Overall, our results cast some doubt on the utility of multi-task setups with linguistic information. The data manipulation techniques, recommended in previous works, prove ineffective in large data settings. The treatment of self-attention as dependencies seems much more promising: it helps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
