An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech   Recognition

Yi-Cheng Wang; Li-Ting Pai; Bi-Cheng Yan; Hsin-Wei Wang; Chi-Han Lin,; Berlin Chen

arXiv:2409.06468·cs.CL·September 11, 2024

An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition

Yi-Cheng Wang, Li-Ting Pai, Bi-Cheng Yan, Hsin-Wei Wang, Chi-Han Lin,, Berlin Chen

PDF

Open Access

TL;DR

This paper proposes a context-balanced adaptation method for long-tailed speech recognition, improving recognition of rare and zero-shot words by addressing data imbalance issues in contextual modeling.

Contribution

It introduces a simple context-balanced learning objective and explores the impact of context list composition, significantly enhancing rare word recognition in E2E ASR models.

Findings

01

Using all vocabulary words as context improves performance.

02

The balanced objective reduces CER by up to 1.21%.

03

Zero-shot word error rate decreases by 9.44%.

Abstract

End-to-end (E2E) automatic speech recognition (ASR) models have become standard practice for various commercial applications. However, in real-world scenarios, the long-tailed nature of word distribution often leads E2E ASR models to perform well on common words but fall short in recognizing uncommon ones. Recently, the notion of a contextual adapter (CA) was proposed to infuse external knowledge represented by a context word list into E2E ASR models. Although CA can improve recognition performance on rare words, two crucial data imbalance problems remain. First, when using low-frequency words as context words during training, since these words rarely occur in the utterance, CA becomes prone to overfit on attending to the <no-context> token due to higher-frequency words not being present in the context list. Second, the long-tailed distribution within the context list itself still…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsAdapter