FOCUS: First Order Concentrated Updating Scheme
Yizhou Liu, Ziming Liu, Jeff Gore

TL;DR
FOCUS is a new optimizer designed to improve large language model training by better handling gradient noise, resulting in increased stability and speed compared to existing methods like Adam and Signum.
Contribution
The paper introduces FOCUS, a novel optimizer that enhances Signum with attraction to moving averaged parameters, addressing gradient noise issues in LLM pre-training.
Findings
FOCUS outperforms Signum in stability during GPT-2 training.
FOCUS trains faster than Adam while maintaining stability.
Gradient noise significantly impacts LLM training efficiency.
Abstract
Large language models (LLMs) demonstrate remarkable performance, and improving their pre-training process appears to be key to enhancing their capabilities further. Based on the documented success of Adam, learning rate decay, and weight decay, we hypothesize that the pre-training loss landscape features a narrowing valley structure. Through experiments with synthetic loss functions, we discover that when gradient query noise is high relative to the valley's sharpness, Adam's performance falls behind that of Signum because Adam reduces the effective step size too drastically. This observation led us to develop FOCUS, an optimizer that enhances Signum by incorporating attraction toward moving averaged parameters, allowing it to handle noise better while maintaining larger step sizes. In training GPT-2, FOCUS proves to be more stable than Signum and faster than Adam. These results suggest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Softmax · Residual Connection · Dropout · Byte Pair Encoding · Attention Dropout · Linear Layer · Linear Warmup With Cosine Annealing
