Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation

Zhennan Lin; Kaixun Huang; Wei Ren; Linju Yang; Lei Xie

arXiv:2505.23077·cs.SD·May 30, 2025

Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation

Zhennan Lin, Kaixun Huang, Wei Ren, Linju Yang, Lei Xie

PDF

Open Access

TL;DR

This paper presents a novel encoder-based method for contextualized ASR that uses dynamic vocabulary prediction and activation to improve phrase integrity and significantly reduce word error rates on benchmark datasets.

Contribution

It introduces a phrase-level biasing approach with architectural optimizations and confidence-activated decoding, enhancing contextual phrase recognition in ASR systems.

Findings

01

Achieves 28.31% and 23.49% relative WER reduction on Librispeech and Wenetspeech.

02

Reduces WER on contextual phrases by over 70%.

03

Demonstrates effectiveness of dynamic vocabulary prediction and confidence activation.

Abstract

Deep biasing improves automatic speech recognition (ASR) performance by incorporating contextual phrases. However, most existing methods enhance subwords in a contextual phrase as independent units, potentially compromising contextual phrase integrity, leading to accuracy reduction. In this paper, we propose an encoder-based phrase-level contextualized ASR method that leverages dynamic vocabulary prediction and activation. We introduce architectural optimizations and integrate a bias loss to extend phrase-level predictions based on frame-level outputs. We also introduce a confidence-activated decoding method that ensures the complete output of contextual phrases while suppressing incorrect bias. Experiments on Librispeech and Wenetspeech datasets demonstrate that our approach achieves relative WER reductions of 28.31% and 23.49% compared to baseline, with the WER on contextual phrases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing