Softmax Bias Correction for Quantized Generative Models
Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel

TL;DR
This paper identifies the bias introduced by quantization in the softmax layer of generative models and proposes an offline bias correction method that enhances quantization accuracy without increasing runtime.
Contribution
The authors introduce a novel offline bias correction technique that reduces softmax quantization bias, improving accuracy of 8-bit quantized generative models without additional inference cost.
Findings
Significant accuracy improvements on stable diffusion v1.5
Enhanced quantization of 125M OPT language model
Bias correction absorbed into quantization parameters
Abstract
Post-training quantization (PTQ) is the go-to compression technique for large generative models, such as stable diffusion or large language models. PTQ methods commonly keep the softmax activation in higher precision as it has been shown to be very sensitive to quantization noise. However, this can lead to a significant runtime and power overhead during inference on resource-constraint edge devices. In this work, we investigate the source of the softmax sensitivity to quantization and show that the quantization operation leads to a large bias in the softmax output, causing accuracy degradation. To overcome this issue, we propose an offline bias correction technique that improves the quantizability of softmax without additional compute during deployment, as it can be readily absorbed into the quantization parameters. We demonstrate the effectiveness of our method on stable diffusion v1.5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
MethodsSoftmax · OPT · Diffusion
