DSFormer: Effective Compression of Text-Transformers by Dense-Sparse   Weight Factorization

Rahul Chand; Yashoteja Prabhu; Pratyush Kumar

arXiv:2312.13211·cs.CL·December 21, 2023·1 cites

DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization

Rahul Chand, Yashoteja Prabhu, Pratyush Kumar

PDF

Open Access

TL;DR

DSFormer introduces a novel dense-sparse weight factorization for transformer models, achieving superior compression and accuracy retention compared to low-rank methods, through a joint learning algorithm and semi-structured sparsity.

Contribution

The paper proposes DSFormer, a new weight factorization scheme with a joint learning algorithm, significantly improving transformer compression efficiency and accuracy over existing low-rank approaches.

Findings

01

Up to 40% better compression than low-rank factorizers.

02

Achieves up to 50% additional compression when combined with other methods.

03

Demonstrates strong efficiency-accuracy trade-offs on NLP benchmarks.

Abstract

With the tremendous success of large transformer models in natural language understanding, down-sizing them for cost-effective deployments has become critical. Recent studies have explored the low-rank weight factorization techniques which are efficient to train, and apply out-of-the-box to any transformer architecture. Unfortunately, the low-rank assumption tends to be over-restrictive and hinders the expressiveness of the compressed model. This paper proposes, DSFormer, a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. The resulting approximation is more faithful to the weight distribution in transformers and therefore achieves a stronger efficiency-accuracy trade-off. Another concern with existing factorizers is their dependence on a task-unaware initialization step which degrades the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Tensor decomposition and applications · Multimodal Machine Learning Applications

MethodsKnowledge Distillation