Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

TL;DR
This paper investigates INT4 quantization for transformer models, demonstrating significant latency improvements with minimal accuracy loss for certain models, and analyzing its limitations and compatibility with other compression techniques.
Contribution
The study develops an optimized INT4 inference pipeline and provides insights into its effectiveness, failure cases, and integration with other model compression methods.
Findings
INT4 quantization causes negligible accuracy loss for encoder models.
INT4 inference pipeline achieves up to 8.5x latency speedup.
Decoder-only models experience significant accuracy degradation with INT4.
Abstract
Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this study, we explore the feasibility of employing INT4 weight and activation (W4A4) quantization for language models. Our findings indicate that W4A4 quantization introduces no to negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using W4A4, we develop a highly optimized end-to-end W4A4 encoder inference pipeline supporting different quantization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Linear Layer · Adam · Layer Normalization · Weight Decay · Multi-Head Attention · Residual Connection · Dense Connections
