Can the Variation of Model Weights be used as a Criterion for Self-Paced Multilingual NMT?
\`Alex R. Atrio, Alexis Allemann, Ljiljana Dolamic, Andrei, Popescu-Belis

TL;DR
This paper proposes a novel criterion based on the variation of model weights to select minibatch languages in multilingual NMT, improving translation quality and convergence speed over fixed strategies.
Contribution
It introduces a new algorithm that uses weight variation as a criterion for self-paced multilingual NMT training, outperforming fixed batch strategies.
Findings
Outperforms alternating monolingual batch strategy.
Does not outperform shuffled batch strategy.
Improves translation quality and convergence speed.
Abstract
Many-to-one neural machine translation systems improve over one-to-one systems when training data is scarce. In this paper, we design and test a novel algorithm for selecting the language of minibatches when training such systems. The algorithm changes the language of the minibatch when the weights of the model do not evolve significantly, as measured by the smoothed KL divergence between all layers of the Transformer network. This algorithm outperforms the use of alternating monolingual batches, but not the use of shuffled batches, in terms of translation quality (measured with BLEU and COMET) and convergence speed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
