Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li

TL;DR
This paper introduces Scale-Distribution Decoupling (SDD), a novel method that stabilizes large language model training by decoupling weight scale and distribution, improving gradient stability and training efficiency.
Contribution
The paper proposes SDD, a new approach that explicitly separates scale and distribution in weight matrices to enhance training stability of large language models.
Findings
SDD stabilizes training across various LLM architectures.
SDD outperforms existing normalization techniques.
SDD is lightweight and compatible with current frameworks.
Abstract
Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing . This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
