Stacking Your Transformers: A Closer Look at Model Growth for Efficient   LLM Pre-Training

Wenyu Du; Tongxu Luo; Zihan Qiu; Zeyu Huang; Yikang Shen; Reynold; Cheng; Yike Guo; Jie Fu

arXiv:2405.15319·cs.CL·October 23, 2024·2 cites

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold, Cheng, Yike Guo, Jie Fu

PDF

Open Access 4 Models 1 Video

TL;DR

This paper introduces a depthwise stacking operator, $G_{stack}$, that accelerates large language model pre-training by enabling scalable, efficient growth, demonstrated through extensive experiments and practical guidelines.

Contribution

It systematically evaluates growth operators, identifies $G_{stack}$ as highly effective, and provides empirical guidelines for its application in large-scale LLM pre-training.

Findings

01

$G_{stack}$ accelerates training and improves performance.

02

$G_{stack}$ scales well up to 7B LLMs and 750B tokens.

03

Model converges faster with 54.6 ext% speedup.

Abstract

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{O}$ bstacles: ( $O$ 1) lack of comprehensive evaluation, ( $O$ 2) untested viability for scaling, and ( $O$ 3) lack of empirical guidelines. To tackle $O$ 1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{stack}$ , exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training· slideslive

Taxonomy

TopicsArtificial Intelligence in Law