NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM
Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu

TL;DR
This paper introduces NxMTransformer, a novel framework using ADMM to induce semi-structured NxM sparsity in pretrained Transformer models, improving NLP task performance and hardware efficiency.
Contribution
The paper proposes a new ADMM-based method for inducing NxM sparsity in pretrained models, addressing generalization issues and hardware constraints in NLP fine-tuning.
Findings
Achieves 1.7 points higher GLUE score than current methods.
Effectively incorporates hardware constraints into sparsification.
Enhances fine-tuning accuracy with knowledge distillation.
Abstract
Natural Language Processing (NLP) has recently achieved success by using huge pre-trained Transformer networks. However, these models often contain hundreds of millions or even billions of parameters, bringing challenges to online deployment due to latency constraints. Recently, hardware manufacturers have introduced dedicated hardware for NxM sparsity to provide the flexibility of unstructured pruning with the runtime efficiency of structured approaches. NxM sparsity permits arbitrarily selecting M parameters to retain from a contiguous group of N in the dense representation. However, due to the extremely high complexity of pre-trained models, the standard sparse fine-tuning techniques often fail to generalize well on downstream tasks, which have limited data resources. To address such an issue in a principled manner, we introduce a new learning framework, called NxMTransformer, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNon-Destructive Testing Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Pruning · Linear Layer · Dropout · Label Smoothing · Layer Normalization · Alternating Direction Method of Multipliers · Dense Connections · Residual Connection · Adam
