BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation
Rapha\"el Bagat, Irina Illina, Emmanuel Vincent

TL;DR
This paper introduces BEARD, a self-supervised learning framework that adapts Whisper's encoder for low-resource, noisy, and specialized speech domains using unlabeled data, improving ASR performance.
Contribution
The paper presents the first use of a self-supervised learning objective for domain adaptation of Whisper, combining BEST-RQ with knowledge distillation for improved speech recognition.
Findings
BEARD achieves a 12% relative improvement over baseline models.
The approach effectively adapts Whisper to challenging ATC communication data.
Using 5,000 hours of unlabeled speech enhances ASR accuracy in low-resource domains.
Abstract
Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder with unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
