Efficient Compression of Multitask Multilingual Speech Models
Thomas Palmeira Ferraz

TL;DR
This paper introduces DistilWhisper, a novel compression method for multilingual speech models that improves recognition accuracy for under-represented languages while maintaining model robustness and efficiency.
Contribution
It proposes a dual strategy of lightweight fine-tuning with language-specific experts and knowledge distillation to enhance multilingual speech model performance.
Findings
DistilWhisper outperforms standard fine-tuning and LoRA adapters.
It improves ASR accuracy for low-resource languages.
The approach introduces negligible parameter overhead.
Abstract
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis
MethodsKnowledge Distillation
