Low-Rank Prune-And-Factorize for Language Model Compression
Siyu Ren, Kenny Q. Zhu

TL;DR
This paper introduces a novel approach combining pruning and matrix factorization to effectively compress large language models by exploiting low-rank sparsity patterns, resulting in better performance at high compression rates.
Contribution
It identifies the full-rankness bottleneck in PLMs and proposes sparsity-aware SVD and mixed-rank fine-tuning to improve model compression.
Findings
Outperforms existing methods in compression-performance trade-off.
Low-rank sparsity patterns are found only in models with first-order pruning.
Proposed techniques enhance initialization and training for better compression.
Abstract
The components underpinning PLMs -- large weight matrices -- were shown to bear considerable redundancy. Matrix factorization, a well-established technique from matrix theory, has been utilized to reduce the number of parameters in PLM. However, it fails to retain satisfactory performance under moderate to high compression rate. In this paper, we identify the \textit{full-rankness} of fine-tuned PLM as the fundamental bottleneck for the failure of matrix factorization and explore the use of network pruning to extract low-rank sparsity pattern desirable to matrix factorization. We find such low-rank sparsity pattern exclusively exists in models generated by first-order pruning, which motivates us to unite the two approaches and achieve more effective model compression. We further propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning, which improve the initialization and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques
MethodsPruning
