CAME: Confidence-guided Adaptive Memory Efficient Optimization

Yang Luo; Xiaozhe Ren; Zangwei Zheng; Zhuo Jiang; Xin Jiang; Yang You

arXiv:2307.02047·cs.CL·August 8, 2023

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Yang Luo, Xiaozhe Ren, Zangwei Zheng, Zhuo Jiang, Xin Jiang, Yang You

PDF

Open Access 2 Repos 10 Models

TL;DR

CAME is a new optimizer that combines fast convergence and low memory usage for training large language models, using a confidence-guided strategy to improve stability and performance.

Contribution

We introduce CAME, a memory-efficient adaptive optimizer with confidence-guided stabilization, achieving superior training speed and accuracy over existing methods.

Findings

01

CAME outperforms Adam in BERT pre-training with large batch sizes.

02

CAME demonstrates faster convergence and higher accuracy across NLP tasks.

03

The method maintains training stability with reduced memory overhead.

Abstract

Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Warmup With Linear Decay · Linear Layer · WordPiece · Weight Decay · Residual Connection · Softmax · Dense Connections