SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

Ziwei Li; Yuang Ma; Yi Kang

arXiv:2604.04493·cs.LG·April 7, 2026

SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

Ziwei Li, Yuang Ma, Yi Kang

PDF

TL;DR

SLaB is a novel decomposition framework for large language models that combines sparsity, low-rank, and binary components to enable efficient compression without retraining.

Contribution

It introduces a new decomposition method that maintains performance at high compression ratios and guides pruning with activation-aware scores.

Findings

01

Achieves up to 36% perplexity reduction at 50% compression.

02

Improves zero-shot task accuracy by up to 8.98%.

03

Outperforms existing compression methods on Llama models.

Abstract

The rapid growth of large language models (LLMs) presents significant deployment challenges due to their massive computational and memory demands. While model compression, such as network pruning, offers potential solutions, most existing methods often fail to maintain good performance at high compression ratios. To address this, we propose SLaB, a novel framework that decomposes each linear layer weight into three complementary components: a sparse matrix, a low-rank matrix, and a binary matrix. SLaB eliminates the need for retraining and leverages activation-aware pruning scores to guide the decomposition process. Experiments on Llama-family models demonstrate that SLaB achieves state-of-the-art performance, reducing perplexity by up to 36% compared to existing methods at 50% compression and improving accuracy by up to 8.98% over the baseline on zero-shot tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.