Can the Variation of Model Weights be used as a Criterion for Self-Paced   Multilingual NMT?

\`Alex R. Atrio; Alexis Allemann; Ljiljana Dolamic; Andrei; Popescu-Belis

arXiv:2410.04147·cs.CL·October 8, 2024

Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?

\`Alex R. Atrio, Alexis Allemann, Ljiljana Dolamic, Andrei, Popescu-Belis

PDF

Open Access

TL;DR

This paper proposes a novel criterion based on the variation of model weights to select minibatch languages in multilingual NMT, improving translation quality and convergence speed over fixed strategies.

Contribution

It introduces a new algorithm that uses weight variation as a criterion for self-paced multilingual NMT training, outperforming fixed batch strategies.

Findings

01

Outperforms alternating monolingual batch strategy.

02

Does not outperform shuffled batch strategy.

03

Improves translation quality and convergence speed.

Abstract

Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings