Hierarchical Transformer for Multilingual Machine Translation

Albina Khusainova; Adil Khan; Ad\'in Ram\'irez Rivera; Vitaly Romanov

arXiv:2103.03589·cs.CL·March 8, 2021·1 cites

Hierarchical Transformer for Multilingual Machine Translation

Albina Khusainova, Adil Khan, Ad\'in Ram\'irez Rivera, Vitaly Romanov

PDF

Open Access

TL;DR

This paper explores a hierarchical Transformer approach for multilingual machine translation, leveraging language relatedness to improve translation quality, and demonstrates that with proper training, it can outperform traditional models.

Contribution

It introduces a hierarchical parameter sharing strategy based on language relatedness within Transformer models for multilingual translation.

Findings

01

Hierarchical models can outperform bilingual and fully shared multilingual models.

02

Careful training strategies are crucial for the success of hierarchical architectures.

03

Hierarchical sharing based on language relatedness improves translation quality.

Abstract

The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the degree of relatedness between different languages, the new general approach to parameter sharing in multilingual machine translation was suggested recently. The main idea is to use these expert language hierarchies as a basis for multilingual architecture: the closer two languages are, the more parameters they share. In this work, we test this idea using the Transformer architecture and show that despite the success in previous work there are problems inherent to training such hierarchical models. We demonstrate that in case of carefully chosen training strategy the hierarchical architecture can outperform bilingual models and multilingual models with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Attention Is All You Need · Dropout · Residual Connection · Adam · Byte Pair Encoding · Label Smoothing