LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
Siqing Song, Chuang Wang, Yong Lang, Yi Yang, Xu-Yao Zhang

TL;DR
LBLLM introduces a three-stage distillation framework for effective W(1+1)A4 quantization of large language models, enabling resource-efficient deployment with minimal performance loss.
Contribution
The paper proposes a novel three-stage quantization strategy that decouples weight and activation quantization, improving stability and accuracy in low-bit LLM deployment.
Findings
LBLLM surpasses state-of-the-art binarization methods on multiple tasks.
Achieves effective W2A4 quantization with only 0.016B tokens trained on a single GPU.
No extra high-precision channels or rotational matrices are needed.
Abstract
Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
