DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition

Yui Sudo; Yosuke Fukumoto; Muhammad Shakeel; Yifan Peng; Chyi-Jiunn Lin; Shinji Watanabe

arXiv:2506.00422·cs.CL·June 4, 2025

DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition

Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe

PDF

Open Access

TL;DR

DYNAC introduces a self-conditioned CTC approach that integrates dynamic vocabulary into speech recognition, significantly improving inference speed while maintaining accuracy on LibriSpeech.

Contribution

It proposes a novel self-conditioned CTC method that effectively incorporates dynamic vocabulary into non-autoregressive speech recognition models.

Findings

01

Reduces real-time factor (RTF) by 81%

02

Degrades word error rate by only 0.1 points

03

Enhances speed without sacrificing accuracy

Abstract

Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis