Quantizing Whisper-small: How design choices affect ASR performance
Arthur S\"ohler, Julian Irigoyen, Andreas S{\o}eborg Kirkedal

TL;DR
This paper evaluates how different post-training quantization techniques affect the performance and size of Whisper-small speech recognition models, aiming to enable deployment on edge devices.
Contribution
It provides a comprehensive, cross-library analysis of PTQ methods on Whisper-small, identifying optimal configurations for size reduction and accuracy preservation.
Findings
Dynamic int8 quantization with Quanto reduces size by 57% and improves WER.
Static quantization underperforms due to Transformer architecture.
Aggressive formats like nf4 and int3 achieve up to 71% compression with accuracy trade-offs.
Abstract
Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
