BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Yifu Ding; Xianglong Liu; Shenghao Jin; Jinyang Guo; Jiwen Lu

arXiv:2604.03957·cs.LG·April 7, 2026

BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

Yifu Ding, Xianglong Liu, Shenghao Jin, Jinyang Guo, Jiwen Lu

PDF

TL;DR

BWTA introduces a novel ultra low-bit quantization scheme with algorithm-hardware co-design, achieving near full-precision accuracy and significant speedups for Transformer models on GPUs.

Contribution

The paper proposes a new binarization and ternarization scheme with specialized training and inference methods, enabling efficient ultra low-bit Transformer inference without accuracy loss.

Findings

01

Approaches full-precision performance on BERT with minimal accuracy drop.

02

Achieves 16-24x speedup over FP16 on NVIDIA GPUs.

03

Delivers 216-330 tokens/sec end-to-end speedup with lower memory footprint.

Abstract

Ultra low-bit quantization brings substantial efficiency for Transformer-based models, but the accuracy degradation and limited GPU support hinder its wide usage. In this paper, we analyze zero-point distortion in binarization and propose a Binary Weights & Ternary Activations (BWTA) quantization scheme, which projects tiny values to zero and preserves the accuracy of extremely low-bit models. For training, we propose Smooth Multi-Stage Quantization, combining a Levelwise Degradation Strategy and a Magnitude-Alignment Projection Factor to enable stable and fast convergence. For inference, we develop a BWTA MatMul CUDA kernel with instruction-level parallel bit-packing and comprehensive binary/ternary MatMul implementations for both linear and attention operators, allowing seamless integration across Transformer architectures. Experiments show that BWTA approaches full-precision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.