CAME: Confidence-guided Adaptive Memory Efficient Optimization
Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You

TL;DR
CAME is a new optimizer that combines fast convergence and low memory usage for training large language models, using a confidence-guided strategy to improve stability and performance.
Contribution
We introduce CAME, a memory-efficient adaptive optimizer with confidence-guided stabilization, achieving superior training speed and accuracy over existing methods.
Findings
CAME outperforms Adam in BERT pre-training with large batch sizes.
CAME demonstrates faster convergence and higher accuracy across NLP tasks.
The method maintains training stability with reduced memory overhead.
Abstract
Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.5model· 505 dl· ♡ 5505 dl♡ 5
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.1model· 2 dl2 dl
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.2model· 7 dl· ♡ 27 dl♡ 2
- 🤗PJMixers-Images/Florence-2-base-Castollux-v0.4model· 6 dl· ♡ 16 dl♡ 1
- 🤗PJMixers-Dev/Gemma-3-Earthen-Completion-v0.1-4B-QLoRAmodel· 1 dl1 dl
- 🤗PJMixers-Dev/Gemma-3-Earthen-Completion-v0.1-4Bmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗PJMixers-Dev/Gemma-3-Earthen-v0.1-4B-QLoRAmodel· 1 dl1 dl
- 🤗PJMixers-Dev/Gemma-3-Earthen-v0.1-4Bmodel· 1 dl1 dl
- 🤗PJMixers-Dev/Gemma-3-Earthen-v0.2-4B-QLoRAmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗PJMixers-Dev/Gemma-3-Earthen-v0.2-4Bmodel· 6 dl· ♡ 16 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Linear Layer · WordPiece · Weight Decay · Residual Connection · Softmax · Dense Connections
