Towards Personalization of CTC Speech Recognition Models with Contextual   Adapters and Adaptive Boosting

Saket Dingliwal; Monica Sunkara; Sravan Bodapati; Srikanth Ronanki,; Jeff Farris; Katrin Kirchhoff

arXiv:2210.09510·cs.CL·November 15, 2022

Towards Personalization of CTC Speech Recognition Models with Contextual Adapters and Adaptive Boosting

Saket Dingliwal, Monica Sunkara, Sravan Bodapati, Srikanth Ronanki,, Jeff Farris, Katrin Kirchhoff

PDF

Open Access

TL;DR

This paper introduces a novel method for personalizing CTC-based speech recognition models by incorporating contextual adapters and adaptive boosting, significantly improving rare word recognition in domain-specific datasets.

Contribution

It proposes a two-way approach combining encoder biasing with attention and dynamic boosting during decoding to enhance personalization of CTC speech models.

Findings

01

Achieved 60% improvement in F1 score on rare words

02

Demonstrated effectiveness on VoxPopuli and medical datasets

03

Enhanced recognition of out-of-vocabulary words

Abstract

End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings