Towards Efficient Optimizer Design for LLM via Structured Fisher   Approximation with a Low-Rank Extension

Wenbo Gong; Meyer Scetbon; Chao Ma; Edward Meeds

arXiv:2502.07752·cs.LG·February 21, 2025

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds

PDF

Open Access

TL;DR

This paper introduces a structured Fisher information matrix approach to design memory-efficient optimizers for large language models, proposing new methods that improve convergence speed and efficiency over existing baselines.

Contribution

It provides a systematic framework linking FIM approximation to optimizer design and introduces two novel optimizers, RACS and Alice, with demonstrated superior performance on LLaMA pre-training.

Findings

01

Alice achieves over 2x faster convergence than Adam.

02

RACS performs well with SGD-like memory usage.

03

Both optimizers outperform existing baselines in experiments.

Abstract

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Advanced Adaptive Filtering Techniques · Control Systems and Identification

MethodsAdam · Stochastic Gradient Descent · LLaMA