SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Tianjin Huang; Ziquan Zhu; Gaojie Jin; Lu Liu; Zhangyang Wang; Shiwei; Liu

arXiv:2501.06842·cs.LG·March 3, 2025

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei, Liu

PDF

1 Repo 3 Reviews

TL;DR

This paper introduces SPAM, a novel optimizer with momentum reset and spike-aware gradient clipping, significantly improving stability and efficiency in large language model training by mitigating gradient spikes.

Contribution

The paper proposes SPAM, a new optimizer that addresses gradient spikes in LLM training through momentum reset and spike-aware clipping, enhancing stability and resource efficiency.

Findings

01

SPAM outperforms Adam and variants in pre-training and fine-tuning tasks.

02

SPAM enables memory-efficient training with sparse momentum updates.

03

SPAM reduces training instability caused by gradient spikes.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000 \times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper’s analysis highlighting the prevalence and impact of gradient spikes in LLM training is insightful and demonstrates an important issue. - The experiments show consistent improvements across different LLM sizes and benchmarks, suggesting the method’s robustness within these settings. - The introduction of sparse momentum is a useful addition for reducing the memory overhead of training large models.

Weaknesses

- Although SPAM is compared to Adam and a few memory-efficient optimizers, it lacks comprehensive analysis against more recent memory-efficient methods. Furthermore, additional experiments, as outlined below, are necessary to strengthen the evaluation.

Reviewer 02Rating 6Confidence 4

Strengths

1. The integration of momentum reset and spike-aware gradient clipping into Adam is noval and addresses the persistent issue of gradient spikes in Large Language Model training. 2. The experiments are thorough and extensive, with evaluations spanning multiple LLM architectures and scales. The results clearly manifest SPAM's superior performance over the standard and memory-efficient baselines. 3. The approach is highly relevant, especially for large-scale training where stability and efficiency

Weaknesses

1. The paper mentions the efficient implementation of momentum reset and spike detection, but a moredetailed practical guidance or pseudo code might improve reproducibility. 2. While SPAM performs excellently across various Large Language Model sizes, additional experiments on tasks beyond LLM training, such as CV models or multi-task learning, should illustrate broader applicability. 3. The choice of the gradient spike threshold might affect performance to a great extent. More discussion on how

Reviewer 03Rating 6Confidence 4

Strengths

This paper addresses the loss spike problem from the perspective of gradient clipping and demonstrates the algorithm's validity through an ablation study on the hyper-parameters used in the algorithm, along with various performance improvements. Additionally, the paper proposes a memory-efficient algorithm using sparse momentum, aiming to solve both the loss spike issue and the out-of-memory problem simultaneously.

Weaknesses

1. Clipping gradients based on a threshold seems to lack novelty. It might be worthwhile to consider methods that prevent gradient spikes altogether. 2. In sparse momentum, a random mask is applied, setting certain gradients to zero. It would be helpful to explain in detail how this actually reduces memory usage. From an algorithmic perspective, it appears as though the entire matrix, including the zero elements, is still being stored.

Code & Models

Repositories

tianjinyellow/spam-optimizer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAdam