Progressive Gradient Flow for Robust N:M Sparsity Training in   Transformers

Abhimanyu Rajeshkumar Bambhaniya; Amir Yazdanbakhsh; Suvinay; Subramanian; Sheng-Chun Kao; Shivani Agrawal; Utku Evci; Tushar Krishna

arXiv:2402.04744·cs.LG·February 8, 2024·2 cites

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay, Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

PDF

Open Access 1 Repo

TL;DR

This paper introduces a progressive gradient flow method to improve the training of high-sparsity N:M structured sparse transformers, significantly enhancing model accuracy and efficiency at high sparsity levels.

Contribution

The authors propose a novel progressive gradient flow technique that mitigates gradient noise in high-sparsity training, outperforming existing methods in model quality and efficiency.

Findings

01

Up to 2% accuracy improvement in vision models at high sparsity.

02

Up to 5% accuracy improvement in language models at high sparsity.

03

Better performance at equal FLOPs compared to conventional sparse training methods.

Abstract

N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ( $\sim$ 50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ( $>$ 80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

abhibambhaniya/progressive_gradient_flow_nm_sparsity
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Adversarial Robustness in Machine Learning

MethodsFocus