Scale-Distribution Decoupling: Enabling Stable and Effective Training of   Large Language Models

Ya Wang; Zhijian Zhuo; Yutao Zeng; Xun Zhou; Jian Yang; Xiaoqing Li

arXiv:2502.15499·cs.CL·February 26, 2025

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces Scale-Distribution Decoupling (SDD), a novel method that stabilizes large language model training by decoupling weight scale and distribution, improving gradient stability and training efficiency.

Contribution

The paper proposes SDD, a new approach that explicitly separates scale and distribution in weight matrices to enhance training stability of large language models.

Findings

01

SDD stabilizes training across various LLM architectures.

02

SDD outperforms existing normalization techniques.

03

SDD is lightweight and compatible with current frameworks.

Abstract

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $gradient explosion and dissipation$ . This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaihemo/sdd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling