DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization
Rahul Chand, Yashoteja Prabhu, Pratyush Kumar

TL;DR
DSFormer introduces a novel dense-sparse weight factorization for transformer models, achieving superior compression and accuracy retention compared to low-rank methods, through a joint learning algorithm and semi-structured sparsity.
Contribution
The paper proposes DSFormer, a new weight factorization scheme with a joint learning algorithm, significantly improving transformer compression efficiency and accuracy over existing low-rank approaches.
Findings
Up to 40% better compression than low-rank factorizers.
Achieves up to 50% additional compression when combined with other methods.
Demonstrates strong efficiency-accuracy trade-offs on NLP benchmarks.
Abstract
With the tremendous success of large transformer models in natural language understanding, down-sizing them for cost-effective deployments has become critical. Recent studies have explored the low-rank weight factorization techniques which are efficient to train, and apply out-of-the-box to any transformer architecture. Unfortunately, the low-rank assumption tends to be over-restrictive and hinders the expressiveness of the compressed model. This paper proposes, DSFormer, a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. The resulting approximation is more faithful to the weight distribution in transformers and therefore achieves a stronger efficiency-accuracy trade-off. Another concern with existing factorizers is their dependence on a task-unaware initialization step which degrades the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Tensor decomposition and applications · Multimodal Machine Learning Applications
MethodsKnowledge Distillation
