Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Paul Jeha, Anastasiia Sedova, Louis B\'ethune, Skyler Seto, Jes Frellsen, Pierre Ablin, Natalie Schluter

TL;DR
In data-constrained language model pre-training, mixing in high-resource auxiliary language data outperforms hyperparameter tuning, especially as model size increases, by diversifying training signals and improving downstream performance.
Contribution
This study systematically compares data mixing and hyperparameter tuning, demonstrating that data mixing yields larger gains in low-resource settings across multiple model scales.
Findings
Mixing outperforms hyperparameter tuning on validation loss and accuracy.
Mixing boosts performance equivalent to 2-3x target data on validation loss.
The benefit of mixing increases steeply with model size.
Abstract
For most languages of the world, language model pre-training operates in a data-constrained regime where models must repeat their training data many times, degrading generalization. Two remedies exist: aggressive hyperparameter tuning such as high weight decay, and mixing in data from a high-resource auxiliary language to directly aid the low-resource target. While hyperparameter tuning regularizes the model by shrinking weights to restrict network capacity, auxiliary data mixing uses a tunable mixing ratio to expand the training distribution and diversify the training signal with new knowledge. Both offer a principled way to improve training in a data-constrained domain. We compare these levers systematically across four model scales from 150M to 1.43B parameters, using Arabic as the low-resource target and English as the auxiliary, over approximately 1000 pre-training runs. Three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
