LittleBit: Ultra Low-Bit Quantization via Latent Factorization
Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim

TL;DR
LittleBit introduces an ultra low-bit quantization framework for large language models, achieving significant compression and speedup while maintaining high performance through latent factorization and compensation mechanisms.
Contribution
The paper presents a novel quantization method targeting 0.1 bits per weight, combining latent factorization with multi-scale compensation and new training techniques for extreme model compression.
Findings
Achieves 31x memory reduction, compressing Llama2-13B to under 0.9 GB.
Outperforms existing methods at 0.7 BPW with 0.1 BPW on Llama2-7B.
Unlocks 11.6x inference speedup over FP16.
Abstract
The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a persistent challenge. In this paper, we introduce LittleBit, a novel framework for extreme LLM compression. We target quantization rates as low as bits per weight (BPW), achieving a memory reduction of approximately , which effectively compresses Llama2-13B to under GB. We represent weights via low-rank latent matrix factorization and subsequently binarize the resulting factors. To counteract the information loss inherent to such drastic precision reduction, we integrate a multi-scale compensation mechanism that learns importance parameters across row, column, and latent dimensions. Two primary contributions enable effective training:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Data Compression Techniques · Image Processing Techniques and Applications
