LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Siqing Song; Chuang Wang; Yong Lang; Yi Yang; Xu-Yao Zhang

arXiv:2604.19167·cs.LG·April 22, 2026

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Siqing Song, Chuang Wang, Yong Lang, Yi Yang, Xu-Yao Zhang

PDF

TL;DR

LBLLM introduces a three-stage distillation framework for effective W(1+1)A4 quantization of large language models, enabling resource-efficient deployment with minimal performance loss.

Contribution

The paper proposes a novel three-stage quantization strategy that decouples weight and activation quantization, improving stability and accuracy in low-bit LLM deployment.

Findings

01

LBLLM surpasses state-of-the-art binarization methods on multiple tasks.

02

Achieves effective W2A4 quantization with only 0.016B tokens trained on a single GPU.

03

No extra high-precision channels or rotational matrices are needed.

Abstract

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.