Shrinking Bigfoot: Reducing wav2vec 2.0 footprint
Zilun Peng, Akshay Budhkar, Ilana Tuil, Jason Levy, Parinaz Sobhani,, Raphael Cohen, Jumana Nassour

TL;DR
This paper presents methods to significantly reduce the size and inference latency of wav2vec 2.0 speech recognition models through distillation and quantization, making them more practical for production use.
Contribution
It introduces the first compression techniques for wav2vec 2.0, achieving smaller, faster models with minimal accuracy loss using knowledge distillation and quantization.
Findings
Student model is 2x faster and 4.8x smaller with 7% WER increase.
Quantized model is 3.6x smaller with 0.1% WER increase.
First to compress wav2vec 2.0 models.
Abstract
Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec's applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student approach, we distilled the knowledge from the original wav2vec 2.0 model into a student model, which is 2 times faster and 4.8 times smaller than the original model. This increase in performance is accomplished with only a 7% degradation in word error rate (WER). Our quantized model is 3.6 times smaller than the original model, with only a 0.1% degradation in WER. To the best of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
