Advancing STT for Low-Resource Real-World Speech
Flavio D'Intino, Hans-Peter Hutter

TL;DR
This paper introduces the SRB-300 dataset for Swiss German speech, capturing spontaneous dialectal speech in real-world settings, and demonstrates significant improvements in speech-to-text performance using fine-tuned Whisper models.
Contribution
The creation of the large SRB-300 dataset with real-world spontaneous Swiss German speech and the fine-tuning of Whisper models to improve STT accuracy in low-resource dialects.
Findings
WER reduced by up to 33%
BLEU scores increased by up to 40%
Best model achieved 17.1% WER and 74.8 BLEU
Abstract
Swiss German is a low-resource language represented by diverse dialects that differ significantly from Standard German and from each other, lacking a standardized written form. As a result, transcribing Swiss German involves translating into Standard German. Existing datasets have been collected in controlled environments, yielding effective speech-to-text (STT) models, but these models struggle with spontaneous conversational speech. This paper, therefore, introduces the new SRB-300 dataset, a 300-hour annotated speech corpus featuring real-world long-audio recordings from 39 Swiss German radio and TV stations. It captures spontaneous speech across all major Swiss dialects recorded in various realistic environments and overcomes the limitation of prior sentence-level corpora. We fine-tuned multiple OpenAI Whisper models on the SRB-300 dataset, achieving notable enhancements over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
