Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages
Devang Kulshreshtha, Saket Dingliwal, Brady Houston, Sravan Bodapati

TL;DR
This paper introduces a multilingual training strategy with supervision loss for Contextual Adapters in CTC-based ASR, significantly enhancing custom word recognition in low-resource languages and improving overall model performance.
Contribution
It proposes a supervised training method and multilingual approach for Contextual Adapters, addressing low-resource language challenges in custom word recognition.
Findings
48% F1 improvement in unseen custom entity retrieval
5-11% WER reduction in base CTC model
Effective strategy for low-resource language ASR
Abstract
Connectionist Temporal Classification (CTC) models are popular for their balance between speed and performance for Automatic Speech Recognition (ASR). However, these CTC models still struggle in other areas, such as personalization towards custom words. A recent approach explores Contextual Adapters, wherein an attention-based biasing model for CTC is used to improve the recognition of custom entities. While this approach works well with enough data, we showcase that it isn't an effective strategy for low-resource languages. In this work, we propose a supervision loss for smoother training of the Contextual Adapters. Further, we explore a multilingual strategy to improve performance with limited training data. Our method achieves 48% F1 improvement in retrieving unseen custom entities for a low-resource language. Interestingly, as a by-product of training the Contextual Adapters, we see…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
MethodsBalanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
