Rank and run-time aware compression of NLP Applications
Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew, Mattina

TL;DR
This paper introduces Hybrid Matrix Factorization, a new compression method for NLP models that improves accuracy and inference speed on small devices by balancing low-rank approximation and dense matrix preservation.
Contribution
The paper presents Hybrid Matrix Factorization, a novel compression technique that enhances low-rank matrix factorization with a hybrid structure for better accuracy and faster inference.
Findings
HMF achieves over 2.32x faster inference than pruning.
HMF provides 16.77% better accuracy than low-rank matrix factorization.
HMF maintains high accuracy across multiple NLP benchmarks.
Abstract
Sequence model based NLP applications can be large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-time than pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
