uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes

Abdul Waheed; Karima Kadaoui; Bhiksha Raj; Muhammad Abdul-Mageed

arXiv:2407.01257·cs.CL·May 16, 2025

uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes

Abdul Waheed, Karima Kadaoui, Bhiksha Raj, Muhammad Abdul-Mageed

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces uDistil-Whisper, a label-free data filtering method for knowledge distillation that enhances low-resource speech recognition models without requiring labeled data, outperforming supervised methods in efficiency and accuracy.

Contribution

The paper presents a novel label-free data filtering framework for distillation that eliminates the need for ground truth labels, enabling effective low-resource speech model training.

Findings

01

Distilled models outperform the teacher by 5-7 WER points.

02

Models are 25-50% more compute- and memory-efficient.

03

Models match or surpass supervised data filtering methods.

Abstract

Recent work on distilling Whisper's knowledge into small models using pseudo-labels shows promising performance while reducing the size by up to 50%. This results in small, efficient, and dedicated models. However, a critical step of distillation using pseudo-labels involves filtering high-quality predictions and using only those during training. This step requires ground truth labels to compare with and filter low-quality examples, making the process dependent on human labels. Additionally, the distillation process requires a large amount of data thereby limiting its applicability in low-resource settings. To address this, we propose a distillation framework that does not require any labeled data. Through experimentation, we show that our best-distilled models outperform the teacher model by 5-7 WER points and are on par with or outperform similar supervised data filtering setups. When…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ubc-nlp/udistilwhisper
pytorchOfficial

Videos

uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes· underline

Taxonomy

TopicsMachine Learning and Data Classification · Rough Sets and Fuzzy Logic