Sophia: A Scalable Stochastic Second-order Optimizer for Language Model   Pre-training

Hong Liu; Zhiyuan Li; David Hall; Percy Liang; Tengyu Ma

arXiv:2305.14342·cs.LG·March 6, 2024·29 cites

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma

PDF

Open Access 5 Repos 3 Models

TL;DR

Sophia is a scalable second-order optimizer that uses diagonal Hessian estimates and clipping to accelerate language model pre-training, reducing training time and compute while maintaining performance.

Contribution

The paper introduces Sophia, a lightweight second-order optimizer with diagonal Hessian estimates and clipping, improving training speed and efficiency for large language models.

Findings

01

Achieves 2x speed-up over Adam in language model training

02

Reduces total compute and wall-clock time by 50%

03

Maintains the same perplexity with fewer training steps

Abstract

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsSecond-order Clipped Stochastic Optimization · Multi-Head Attention · Attention Is All You Need · GPT · Cosine Annealing · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer