nGPT: Normalized Transformer with Representation Learning on the   Hypersphere

Ilya Loshchilov; Cheng-Ping Hsieh; Simeng Sun; Boris Ginsburg

arXiv:2410.01131·cs.LG·April 25, 2025

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg

PDF

Open Access 3 Models

TL;DR

nGPT introduces a normalized Transformer architecture where all vectors are unit norm and reside on a hypersphere, leading to significantly faster learning and reduced training steps for comparable accuracy.

Contribution

The paper presents a novel normalized Transformer architecture with representation learning on the hypersphere, improving training efficiency and convergence speed.

Findings

01

nGPT learns 4 to 20 times faster than traditional models.

02

All vectors in nGPT are normalized to unit norm, residing on a hypersphere.

03

Training efficiency improves across different sequence lengths.

Abstract

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding