TL;DR
This paper introduces SSLC, a novel method combining sparse and low-rank techniques to compress large language models efficiently, achieving significant size reduction and speedup without performance loss.
Contribution
The paper presents a unified optimization framework for combining sparse and low-rank compression, demonstrating superior results over standalone methods on large language models.
Findings
Qwen2.5 compressed by 50% with no performance drop
Achieves at least 1.63× speedup
Outperforms existing standalone compression methods
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in language comprehension and generation; however, their widespread adoption is constrained by substantial bandwidth and computational demands. While pruning and low-rank approximation have each demonstrated promising performance individually, their synergy for LLMs remains underexplored. We introduce \underline{S}ynergistic \underline{S}parse and \underline{L}ow-Rank \underline{C}ompression (SSLC) methods for LLMs, which leverages the strengths of both techniques: low-rank approximation compresses the model by retaining its essential structure with minimal information loss, whereas sparse optimization eliminates non-essential weights, preserving those crucial for generalization. Based on theoretical analysis, we first formulate the low-rank approximation and sparse optimization as a unified problem and solve it by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
