GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Tianhao Chen; Xin Xu; Zijing Liu; Pengxiang Li; Xinyuan Song; Ajay Kumar Jaiswal; Fan Zhang; Jishan Hu; Yang Wang; Hao Chen; Shizhe Diao; Shiwei Liu; Yu Li; Lu Yin; Can Yang

arXiv:2506.22049·cs.LG·July 4, 2025

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang

PDF

Open Access 1 Models

TL;DR

GPAS is a technique that scales down activations in Pre-LN Transformers during pretraining, preserving gradients and improving convergence and performance across various large language models.

Contribution

We introduce GPAS, a simple activation scaling method that preserves gradients, enhancing training stability and performance in Pre-LN Transformers and other architectures.

Findings

01

GPAS consistently improves model performance across sizes from 71M to 1B parameters.

02

GPAS enhances training stability in Pre-LN, Sandwich-LN, and DeepNorm architectures.

03

The method is versatile and can be combined with existing training approaches.

Abstract

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KitsuVp/NeoLLM
model· 2.9k dl· ♡ 1
2.9k dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Neural Network Applications