Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task
Ga\"etan Caillaut, Raheel Qader, Mariam Nakhl\'e, Jingshu Liu,, Jean-Gabriel Barth\'elemy

TL;DR
This paper investigates the scaling laws of decoder-only models in multilingual machine translation, revealing that their performance can be predicted by laws similar to large language models, with limitations at larger scales.
Contribution
It provides the first detailed analysis of scaling laws for decoder-only models in multilingual translation, comparing different scaling methods and their effects on performance and efficiency.
Findings
Scaling laws for decoder-only models resemble those of large language models.
Scaling depth and width similarly improve test loss but differently affect efficiency.
Scaling laws have limitations at very large model sizes or different data distributions.
Abstract
Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational Physics and Python Applications · Advanced Computational Techniques and Applications
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections
