Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation
Changsong Liu, Yizhou Peng, and Eng Siong Chng

TL;DR
This paper introduces a zero-shot contextual ASR method that synthesizes multiple pronunciations of rare words using TTS, encodes them in a trie, and improves recognition accuracy for out-of-vocabulary words.
Contribution
It presents a novel synthesis-driven multi-pronunciation biasing approach that enhances zero-shot recognition of rare words in pretrained ASR models.
Findings
Reduces biased-word error rate by over 43% on LibriSpeech.
Maintains unbiased-WER while improving rare word recognition.
Uses TTS and trie-based decoding for effective zero-shot biasing.
Abstract
Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. Subsequently, any recognized variant is mapped back to the original rare word in the final…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
