Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang; Congliang Chen; Ziniu Li; Tian Ding; Chenwei Wu,; Diederik P. Kingma; Yinyu Ye; Zhi-Quan Luo; Ruoyu Sun

arXiv:2406.16793·cs.LG·February 25, 2025·1 cites

Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu,, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

Adam-mini is a memory-efficient optimizer that reduces resource usage by simplifying the learning rate structure, achieving comparable or better performance than AdamW across various language models.

Contribution

The paper introduces Adam-mini, a novel optimizer that reduces memory footprint by removing unnecessary learning rates based on Hessian structure analysis, with demonstrated empirical improvements.

Findings

01

Adam-mini matches or surpasses AdamW performance.

02

It achieves 49.6% higher throughput in pre-training tasks.

03

Reduces memory and communication overheads significantly.

Abstract

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/ v$ ). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $\geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Adam-mini cuts memory use by almost half (up to 50%) vs AdamW, by smartly dividing params based on Hessian structures and using block-level learning rates; this gives major memory savings without losing much performance on big language models. 2. By reducing comunication overheads between GPUs, Adam-mini boosts throughput by about 49.6% during Llama 2-7B training and lowers training time by 33%, which is really helpful in setups with limited resources. 3. Adam-mini provides a more effic

Weaknesses

- This study mostly benchmarks on transformer models, leaving its effectivness on other architectures like CNNs and RNNs a bit underlooked; more benchmarking could either show it’s limitations or confirm its ability to adapt across other models types. - The influence of the Hessian-based learning rate grouping on different gradient structures wasn’t deeply investigated, and comparing it with fully adaptive methods would make it’s efficiency clearer. - Optimizer stability over long training dur

Reviewer 02Rating 8Confidence 4

Strengths

1. This paper studies an important problem and provides an interesting and novel solution. 2. The paper is well-written with clear motivation and good presentation flow. 3. Extensive experiments are conducted to justify Adam-mini's effectiveness.

Weaknesses

1. Integrating Adam-mini into existing training frameworks seems non-trivial. For example, according to Algo 2, we need to specify the partitions manually for large models. Is there any solution to automatically get a good partition given general Pytorch models? 2. It will be good to add discussions/insights about how to adapt hyper-parameters when switching from AdamW to Adam-mini. 3. In addition to training curves, it will be better to have comparisons after model convergence. Additionally, i

Reviewer 03Rating 8Confidence 3

Strengths

The paper is well-written, and the proposed idea is novel. The approach, though simple, achieves a substantial reduction in memory usage, higher throughput, and shorter wall-clock time without compromising performance. Given the increasing adoption of LLMs and their high demands on memory resources, this method has the potential to be impactful in memory-constrained LLM applications.

Weaknesses

The method proposed in the paper does not seem to address large-scale non-Transformer models. Although Section C.2 includes experiments on various non-LLM tasks, it would be helpful to discuss whether Adam-mini offers advantages for large-scale non-LLM applications or if AdamW is preferred in these cases. Moreover, given the success of the method on GPT-2 and Llama series and its potential impact, demonstrating the effectiveness of Adam-mini on more diverse models, such as BERT-like and vision

Code & Models

Repositories

zyushun/adam-mini
pytorchOfficial

Models

🤗
Menlo/llama3-s-2024-07-08
model· 21 dl· ♡ 10
21 dl♡ 10

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Adaptive Filtering Techniques · Music and Audio Processing

MethodsLLaMA · Adaptive Moment Estimation - Mini · Adam · AdamW