Scaling Laws of Decoder-Only Models on the Multilingual Machine   Translation Task

Ga\"etan Caillaut; Raheel Qader; Mariam Nakhl\'e; Jingshu Liu,; Jean-Gabriel Barth\'elemy

arXiv:2409.15051·cs.CL·September 24, 2024

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Ga\"etan Caillaut, Raheel Qader, Mariam Nakhl\'e, Jingshu Liu,, Jean-Gabriel Barth\'elemy

PDF

Open Access 1 Video

TL;DR

This paper investigates the scaling laws of decoder-only models in multilingual machine translation, revealing that their performance can be predicted by laws similar to large language models, with limitations at larger scales.

Contribution

It provides the first detailed analysis of scaling laws for decoder-only models in multilingual translation, comparing different scaling methods and their effects on performance and efficiency.

Findings

01

Scaling laws for decoder-only models resemble those of large language models.

02

Scaling depth and width similarly improve test loss but differently affect efficiency.

03

Scaling laws have limitations at very large model sizes or different data distributions.

Abstract

Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention. This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual and multidomain dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task· underline

Taxonomy

TopicsComputational Physics and Python Applications · Advanced Computational Techniques and Applications

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections