BWLA: Breaking the Barrier of W1AX Post-Training Quantization for LLMs
Zhixiong Zhao, Zukang Xu, Dawei Yang

TL;DR
BWLA introduces a novel post-training quantization method for LLMs that combines 1-bit weights with low-bit activations, achieving high accuracy and significant speedup.
Contribution
It is the first framework to enable end-to-end acceleration of LLMs by jointly quantizing weights and activations with novel transformations.
Findings
Achieves Wikitext2 perplexity of 11.92 with 6-bit activations on Qwen3-32B.
Improves five zero-shot tasks by over 70%.
Provides 3.26x inference speedup.
Abstract
Large language models (LLMs) have driven major progress in NLP, yet their substantial memory and compute demands still hinder practical deployment. Binarization can compress weights to 1 bit, fundamentally lowering compute and bandwidth cost. However, existing methods cannot address activation heavy tails and thus must keep activations in high precision, preventing true end-to-end acceleration. To overcome this limitation, we propose BWLA (Binarized Weights and Low-bit Activations), the first post-training quantization framework that preserves high accuracy while achieving 1-bit weight quantization together with low-bit activations (e.g., 6 bits). The Orthogonal-Kronecker Transformation (OKT) learns an orthogonal mapping via EM minimization, converting unimodal weights into symmetric bimodal forms while suppressing activation tails and incoherence. The Proximal SVD Projection (PSP) then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
