Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds

TL;DR
This paper introduces a structured Fisher information matrix approach to design memory-efficient optimizers for large language models, proposing new methods that improve convergence speed and efficiency over existing baselines.
Contribution
It provides a systematic framework linking FIM approximation to optimizer design and introduces two novel optimizers, RACS and Alice, with demonstrated superior performance on LLaMA pre-training.
Findings
Alice achieves over 2x faster convergence than Adam.
RACS performs well with SGD-like memory usage.
Both optimizers outperform existing baselines in experiments.
Abstract
Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Advanced Adaptive Filtering Techniques · Control Systems and Identification
MethodsAdam · Stochastic Gradient Descent · LLaMA
