Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model

Vishakha Lall; Yisi Liu

arXiv:2410.18363·cs.AI·August 12, 2025

Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model

Vishakha Lall, Yisi Liu

PDF

Open Access

TL;DR

This paper introduces a method to improve domain-specific speech transcription accuracy by using contextual biasing with a neural-symbolic prefix tree, avoiding explicit fine-tuning of the Whisper model.

Contribution

The study presents a novel approach to enhance Whisper's transcription accuracy in specific domains without fine-tuning, using a neural-symbolic prefix tree for contextual biasing.

Findings

01

Significant reduction in word error rate in maritime domain data.

02

Enhanced downstream application performance with biased transcription.

03

Method effective across different Whisper model sizes.

Abstract

OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains. However, this broad adaptability can lead to diminished performance in tasks requiring recognition of specific vocabularies. Addressing this challenge typically involves fine-tuning the model, which demands extensive labeled audio data that is often difficult to acquire and unavailable for specific domains. In this study, we propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters, using a relatively small training dataset. Our method leverages contextual biasing, to direct Whisper model's output towards a specific vocabulary by integrating a neural-symbolic prefix tree structure to guide the model's transcription output. To validate our approach, we conducted experiments using a validation dataset comprising maritime data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing