SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
Ziwei Li, Yuang Ma, Yi Kang

TL;DR
SLaB is a novel decomposition framework for large language models that combines sparsity, low-rank, and binary components to enable efficient compression without retraining.
Contribution
It introduces a new decomposition method that maintains performance at high compression ratios and guides pruning with activation-aware scores.
Findings
Achieves up to 36% perplexity reduction at 50% compression.
Improves zero-shot task accuracy by up to 8.98%.
Outperforms existing compression methods on Llama models.
Abstract
The rapid growth of large language models (LLMs) presents significant deployment challenges due to their massive computational and memory demands. While model compression, such as network pruning, offers potential solutions, most existing methods often fail to maintain good performance at high compression ratios. To address this, we propose SLaB, a novel framework that decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. SLaB eliminates the need for retraining and leverages activation-aware pruning scores to guide the decomposition process. Experiments on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% compared to existing methods at 50% compression and improving accuracy by up to 8.98% over the baseline on zero-shot tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
