BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design
Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu

TL;DR
BWTA introduces a novel ultra low-bit quantization scheme with algorithm-hardware co-design, achieving near full-precision accuracy and significant speedups for Transformer models on GPUs.
Contribution
The paper proposes a new binarization and ternarization scheme with specialized training and inference methods, enabling efficient ultra low-bit Transformer inference without accuracy loss.
Findings
Approaches full-precision performance on BERT with minimal accuracy drop.
Achieves 16-24x speedup over FP16 on NVIDIA GPUs.
Delivers 216-330 tokens/sec end-to-end speedup with lower memory footprint.
Abstract
Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
