Hierarchical Transformer for Multilingual Machine Translation
Albina Khusainova, Adil Khan, Ad\'in Ram\'irez Rivera, Vitaly Romanov

TL;DR
This paper explores a hierarchical Transformer approach for multilingual machine translation, leveraging language relatedness to improve translation quality, and demonstrates that with proper training, it can outperform traditional models.
Contribution
It introduces a hierarchical parameter sharing strategy based on language relatedness within Transformer models for multilingual translation.
Findings
Hierarchical models can outperform bilingual and fully shared multilingual models.
Careful training strategies are crucial for the success of hierarchical architectures.
Hierarchical sharing based on language relatedness improves translation quality.
Abstract
The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the degree of relatedness between different languages, the new general approach to parameter sharing in multilingual machine translation was suggested recently. The main idea is to use these expert language hierarchies as a basis for multilingual architecture: the closer two languages are, the more parameters they share. In this work, we test this idea using the Transformer architecture and show that despite the success in previous work there are problems inherent to training such hierarchical models. We demonstrate that in case of carefully chosen training strategy the hierarchical architecture can outperform bilingual models and multilingual models with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Attention Is All You Need · Dropout · Residual Connection · Adam · Byte Pair Encoding · Label Smoothing
