Leveraging Self-Supervised Models for Automatic Whispered Speech Recognition
Aref Farhadipour, Homa Asadi, Volker Dellwo

TL;DR
This paper introduces a novel self-supervised WavLM-based approach for recognizing Irish dialect whispered speech, significantly outperforming existing models like OpenAI Whisper in accuracy.
Contribution
It presents a fine-tuned WavLM model specifically adapted for whispered speech recognition in Irish dialects, addressing the scarcity of training data and acoustic challenges.
Findings
WavLM-based system achieved WER of 9.22%
Significant improvement over OpenAI Whisper baseline
Demonstrates effectiveness for dialect-specific whispered speech
Abstract
In automatic speech recognition, any factor that alters the acoustic properties of speech can pose a challenge to the system's performance. This paper presents a novel approach for automatic whispered speech recognition in the Irish dialect using the self-supervised WavLM model. Conventional automatic speech recognition systems often fail to accurately recognise whispered speech due to its distinct acoustic properties and the scarcity of relevant training data. To address this challenge, we utilized a pre-trained WavLM model, fine-tuned with a combination of whispered and normal speech data from the wTIMIT and CHAINS datasets, which include the English language in Singaporean and Irish dialects, respectively. Our baseline evaluation with the OpenAI Whisper model highlighted its limitations, achieving a Word Error Rate (WER) of 18.8% and a Character Error Rate (CER) of 4.24% on whispered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
