SimpleGPT: Improving GPT via A Simple Normalization Strategy
Marco Chen, Xianbiao Qi, Yelin He, Jiaquan Ye, Rong Xiao

TL;DR
This paper introduces SimpleNorm, a normalization strategy that stabilizes activation scales and reduces the Hessian spectral norm, enabling larger learning rates and improved training stability for large GPT models.
Contribution
We propose SimpleNorm, a simple normalization method that improves GPT training by stabilizing activations and allowing larger learning rates, backed by theoretical analysis and extensive experiments.
Findings
SimpleNorm reduces the Hessian spectral norm.
Models trained with SimpleNorm tolerate 3-10x larger learning rates.
SimpleGPT outperforms baselines in training stability and performance.
Abstract
In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3-10 larger than standard convention, consistently demonstrates strong optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Data Classification
