Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold, Cheng, Yike Guo, Jie Fu

TL;DR
This paper introduces a depthwise stacking operator, $G_{stack}$, that accelerates large language model pre-training by enabling scalable, efficient growth, demonstrated through extensive experiments and practical guidelines.
Contribution
It systematically evaluates growth operators, identifies $G_{stack}$ as highly effective, and provides empirical guidelines for its application in large-scale LLM pre-training.
Findings
$G_{stack}$ accelerates training and improves performance.
$G_{stack}$ scales well up to 7B LLMs and 750B tokens.
Model converges faster with 54.6 ext% speedup.
Abstract
LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical bstacles: (1) lack of comprehensive evaluation, (2) untested viability for scaling, and (3) lack of empirical guidelines. To tackle 1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called , exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Law
