Softmax Bias Correction for Quantized Generative Models

Nilesh Prasad Pandey; Marios Fournarakis; Chirag Patel; Markus Nagel

arXiv:2309.01729·cs.LG·September 6, 2023·1 cites

Softmax Bias Correction for Quantized Generative Models

Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel

PDF

Open Access

TL;DR

This paper identifies the bias introduced by quantization in the softmax layer of generative models and proposes an offline bias correction method that enhances quantization accuracy without increasing runtime.

Contribution

The authors introduce a novel offline bias correction technique that reduces softmax quantization bias, improving accuracy of 8-bit quantized generative models without additional inference cost.

Findings

01

Significant accuracy improvements on stable diffusion v1.5

02

Enhanced quantization of 125M OPT language model

03

Bias correction absorbed into quantization parameters

Abstract

Post-training quantization (PTQ) is the go-to compression technique for large generative models, such as stable diffusion or large language models. PTQ methods commonly keep the softmax activation in higher precision as it has been shown to be very sensitive to quantization noise. However, this can lead to a significant runtime and power overhead during inference on resource-constraint edge devices. In this work, we investigate the source of the softmax sensitivity to quantization and show that the quantization operation leads to a large bias in the softmax output, causing accuracy degradation. To overcome this issue, we propose an offline bias correction technique that improves the quantizability of softmax without additional compute during deployment, as it can be readily absorbed into the quantization parameters. We demonstrate the effectiveness of our method on stable diffusion v1.5…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis

MethodsSoftmax · OPT · Diffusion