Adam-mini: Use Fewer Learning Rates To Gain More
Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu,, Diederik P. Kingma, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

TL;DR
Adam-mini is a memory-efficient optimizer that reduces resource usage by simplifying the learning rate structure, achieving comparable or better performance than AdamW across various language models.
Contribution
The paper introduces Adam-mini, a novel optimizer that reduces memory footprint by removing unnecessary learning rates based on Hessian structure analysis, with demonstrated empirical improvements.
Findings
Adam-mini matches or surpasses AdamW performance.
It achieves 49.6% higher throughput in pre-training tasks.
Reduces memory and communication overheads significantly.
Abstract
We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., ). By investigating the Hessian structure of neural nets, we find Adam's might not function at its full potential as effectively as we expected. We find that 99.9% of these learning rates in could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced…
Peer Reviews
Decision·ICLR 2025 Poster
1. Adam-mini cuts memory use by almost half (up to 50%) vs AdamW, by smartly dividing params based on Hessian structures and using block-level learning rates; this gives major memory savings without losing much performance on big language models. 2. By reducing comunication overheads between GPUs, Adam-mini boosts throughput by about 49.6% during Llama 2-7B training and lowers training time by 33%, which is really helpful in setups with limited resources. 3. Adam-mini provides a more effic
- This study mostly benchmarks on transformer models, leaving its effectivness on other architectures like CNNs and RNNs a bit underlooked; more benchmarking could either show it’s limitations or confirm its ability to adapt across other models types. - The influence of the Hessian-based learning rate grouping on different gradient structures wasn’t deeply investigated, and comparing it with fully adaptive methods would make it’s efficiency clearer. - Optimizer stability over long training dur
1. This paper studies an important problem and provides an interesting and novel solution. 2. The paper is well-written with clear motivation and good presentation flow. 3. Extensive experiments are conducted to justify Adam-mini's effectiveness.
1. Integrating Adam-mini into existing training frameworks seems non-trivial. For example, according to Algo 2, we need to specify the partitions manually for large models. Is there any solution to automatically get a good partition given general Pytorch models? 2. It will be good to add discussions/insights about how to adapt hyper-parameters when switching from AdamW to Adam-mini. 3. In addition to training curves, it will be better to have comparisons after model convergence. Additionally, i
The paper is well-written, and the proposed idea is novel. The approach, though simple, achieves a substantial reduction in memory usage, higher throughput, and shorter wall-clock time without compromising performance. Given the increasing adoption of LLMs and their high demands on memory resources, this method has the potential to be impactful in memory-constrained LLM applications.
The method proposed in the paper does not seem to address large-scale non-Transformer models. Although Section C.2 includes experiments on various non-LLM tasks, it would be helpful to discuss whether Adam-mini offers advantages for large-scale non-LLM applications or if AdamW is preferred in these cases. Moreover, given the success of the method on GPT-2 and Llama series and its potential impact, demonstrating the effectiveness of Adam-mini on more diverse models, such as BERT-like and vision
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Adaptive Filtering Techniques · Music and Audio Processing
MethodsLLaMA · Adaptive Moment Estimation - Mini · Adam · AdamW
