Understanding INT4 Quantization for Transformer Models: Latency Speedup,   Composability, and Failure Cases

Xiaoxia Wu; Cheng Li; Reza Yazdani Aminabadi; Zhewei Yao; Yuxiong He

arXiv:2301.12017·cs.CL·June 1, 2023·6 cites

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

PDF

Open Access 1 Repo

TL;DR

This paper investigates INT4 quantization for transformer models, demonstrating significant latency improvements with minimal accuracy loss for certain models, and analyzing its limitations and compatibility with other compression techniques.

Contribution

The study develops an optimized INT4 inference pipeline and provides insights into its effectiveness, failure cases, and integration with other model compression methods.

Findings

01

INT4 quantization causes negligible accuracy loss for encoder models.

02

INT4 inference pipeline achieves up to 8.5x latency speedup.

03

Decoder-only models experience significant accuracy degradation with INT4.

Abstract

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this study, we explore the feasibility of employing INT4 weight and activation (W4A4) quantization for language models. Our findings indicate that W4A4 quantization introduces no to negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using W4A4, we develop a highly optimized end-to-end W4A4 encoder inference pipeline supporting different quantization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/DeepSpeed
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Linear Layer · Adam · Layer Normalization · Weight Decay · Multi-Head Attention · Residual Connection · Dense Connections