Preparing Lessons for Progressive Training on Language Models
Yu Pan, Ye Yuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang,, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR
The paper introduces Apollo, a novel training method that enables efficient progressive expansion of language models by learning high-layer functions during low-layer training, significantly reducing resource use and environmental impact.
Contribution
Apollo is a new approach that prepares lessons for model expansion using low-value-prioritized sampling, weight sharing, and interpolation, improving training efficiency without pretrained models.
Findings
Achieves state-of-the-art acceleration ratios.
Rivals pretrained model-based methods in efficiency.
Reduces training time and environmental costs.
Abstract
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques
MethodsAdaptive Parameter-wise Diagonal Quasi-Newton Method
