SimpleGPT: Improving GPT via A Simple Normalization Strategy

Marco Chen; Xianbiao Qi; Yelin He; Jiaquan Ye; Rong Xiao

arXiv:2602.01212·cs.LG·February 3, 2026

SimpleGPT: Improving GPT via A Simple Normalization Strategy

Marco Chen, Xianbiao Qi, Yelin He, Jiaquan Ye, Rong Xiao

PDF

Open Access 1 Models

TL;DR

This paper introduces SimpleNorm, a normalization strategy that stabilizes activation scales and reduces the Hessian spectral norm, enabling larger learning rates and improved training stability for large GPT models.

Contribution

We propose SimpleNorm, a simple normalization method that improves GPT training by stabilizing activations and allowing larger learning rates, backed by theoretical analysis and extensive experiments.

Findings

01

SimpleNorm reduces the Hessian spectral norm.

02

Models trained with SimpleNorm tolerate 3-10x larger learning rates.

03

SimpleGPT outperforms baselines in training stability and performance.

Abstract

In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3 $\times$ -10 $\times$ larger than standard convention, consistently demonstrates strong optimization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KitsuVp/NeoLLM
model· 2.9k dl· ♡ 1
2.9k dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Data Classification