Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci

TL;DR
This paper introduces Pivoting Factorization (PIFA), a lossless low-rank representation that enhances model compression and inference efficiency in large language models, outperforming existing methods in memory savings and GPU speed.
Contribution
We propose PIFA, a novel lossless meta low-rank representation that effectively reduces redundancy and improves inference speed in large language models.
Findings
PIFA achieves 24.2% additional memory savings.
PIFA provides 24.6% faster inference at rank = 50%.
MPIFA outperforms existing low-rank pruning methods.
Abstract
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Pruning
