Warmstarting for Scaling Language Models

Neeratyoy Mallik; Maciej Janowski; Johannes Hog; Herilalaina; Rakotoarison; Aaron Klein; Josif Grabocka; Frank Hutter

arXiv:2411.07340·cs.LG·November 13, 2024

Warmstarting for Scaling Language Models

Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina, Rakotoarison, Aaron Klein, Josif Grabocka, Frank Hutter

PDF

Open Access

TL;DR

This paper investigates warmstarting large language model training from smaller models to reduce costs, focusing on hyperparameter transfer and stable training dynamics.

Contribution

It introduces methods for effective warmstarting using {}Transfer and {}P, enabling cost-efficient scaling of language models with preserved training stability.

Findings

01

Warmstarting retains optimal hyperparameters effectively.

02

Shrinkage and zero-padding facilitate transfer.

03

Perturbation with scaled initialization improves convergence.

Abstract

Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using {\mu}Transfer. We investigate the aspects that contribute to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques