Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model
Vishakha Lall, Yisi Liu

TL;DR
This paper introduces a method to improve domain-specific speech transcription accuracy by using contextual biasing with a neural-symbolic prefix tree, avoiding explicit fine-tuning of the Whisper model.
Contribution
The study presents a novel approach to enhance Whisper's transcription accuracy in specific domains without fine-tuning, using a neural-symbolic prefix tree for contextual biasing.
Findings
Significant reduction in word error rate in maritime domain data.
Enhanced downstream application performance with biased transcription.
Method effective across different Whisper model sizes.
Abstract
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains. However, this broad adaptability can lead to diminished performance in tasks requiring recognition of specific vocabularies. Addressing this challenge typically involves fine-tuning the model, which demands extensive labeled audio data that is often difficult to acquire and unavailable for specific domains. In this study, we propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters, using a relatively small training dataset. Our method leverages contextual biasing, to direct Whisper model's output towards a specific vocabulary by integrating a neural-symbolic prefix tree structure to guide the model's transcription output. To validate our approach, we conducted experiments using a validation dataset comprising maritime data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
