LESA: Learnable LLM Layer Scaling-Up
Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen and, Hai Zhao

TL;DR
LESA introduces a learnable layer scaling method for LLMs that improves initialization and training efficiency by predicting inter-layer parameters, outperforming existing methods with reduced computational costs.
Contribution
LESA proposes a novel neural network-based approach for depth scaling-up of LLMs, enabling learnable inter-layer parameters and faster, more effective training.
Findings
LESA achieves better performance than baseline methods.
LESA reduces training cost by over 50%.
LESA demonstrates robustness across various model sizes and tasks.
Abstract
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose \textbf{LESA}, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMathematics, Computing, and Information Processing · Speech Recognition and Synthesis · Neural Networks and Applications
