Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers
Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay, Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

TL;DR
This paper introduces a progressive gradient flow method to improve the training of high-sparsity N:M structured sparse transformers, significantly enhancing model accuracy and efficiency at high sparsity levels.
Contribution
The authors propose a novel progressive gradient flow technique that mitigates gradient noise in high-sparsity training, outperforming existing methods in model quality and efficiency.
Findings
Up to 2% accuracy improvement in vision models at high sparsity.
Up to 5% accuracy improvement in language models at high sparsity.
Better performance at equal FLOPs compared to conventional sparse training methods.
Abstract
N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions (50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions (80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Adversarial Robustness in Machine Learning
MethodsFocus
