AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training
Zhiyuan Li, Yuan Wu, Yi Chang

TL;DR
AGGC introduces an adaptive, group-wise gradient clipping method that stabilizes large language model training by addressing gradient heterogeneity, outperforming traditional methods and enhancing model accuracy and stability.
Contribution
The paper proposes AGGC, a novel adaptive group-wise gradient clipping technique that partitions parameters into functional groups and regulates them based on historical behavior, improving training stability.
Findings
AGGC outperforms LoRA and often surpasses full fine-tuning.
On GSM8K, Mistral-7B with AGGC achieves 72.93% accuracy.
AGGC stabilizes RLVR and enhances logic deduction in LLMs.
Abstract
To stabilize the training of Large Language Models (LLMs), gradient clipping is a nearly ubiquitous heuristic used to alleviate exploding gradients. However, traditional global norm clipping erroneously presupposes gradient homogeneity across different functional modules, leading to an adverse "spill-over" effect where volatile parameters force unnecessary scaling on stable ones. To overcome this, we propose Adaptive Group-wise Gradient Clipping (AGGC). AGGC partitions parameters into groups based on functional types and regulates each according to its historical behavior using an Exponential Moving Average (EMA). Specifically, it constructs an adaptive interval to simultaneously mitigate gradient explosion and vanishing, while employing a time-dependent scheduling mechanism to balance exploration and convergence. Experiments on LLaMA 2-7B, Mistral-7B, and Gemma-7B models show that AGGC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
