Contextualized Automatic Speech Recognition with Dynamic Vocabulary
Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji, Watanabe

TL;DR
This paper introduces a dynamic vocabulary approach for end-to-end speech recognition that adds bias tokens during inference, improving recognition accuracy for rare and contextual phrases without complex additional modules.
Contribution
It proposes a simple, architecture-agnostic method that treats bias phrases as single tokens, enhancing biasing performance in E2E-ASR models.
Findings
Improves bias phrase WER by 3.1 to 4.9 points on English and Japanese datasets.
Eliminates the need for external language model fusion or rescoring modules.
Easily integrates into existing E2E-ASR architectures.
Abstract
Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
