Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

Jialin Zhao; Yingtao Zhang; Carlo Vittorio Cannistraci

arXiv:2501.19090·cs.LG·August 14, 2025

Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci

PDF

Open Access 1 Video

TL;DR

This paper introduces Pivoting Factorization (PIFA), a lossless low-rank representation that enhances model compression and inference efficiency in large language models, outperforming existing methods in memory savings and GPU speed.

Contribution

We propose PIFA, a novel lossless meta low-rank representation that effectively reduces redundancy and improves inference speed in large language models.

Findings

01

PIFA achieves 24.2% additional memory savings.

02

PIFA provides 24.6% faster inference at rank = 50%.

03

MPIFA outperforms existing low-rank pruning methods.

Abstract

The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Pruning