End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system
Zhengyi Zhang, Pan Zhou

TL;DR
This paper introduces a novel contextual bias attention module for end-to-end speech recognition models, significantly improving recognition of infrequent proper nouns while maintaining overall performance.
Contribution
It proposes a CBA module that adapts posterior distributions of CTC and attention decoders based on preloaded bias phrases, enhancing contextual phrase recognition.
Findings
15% to 28% improvement in bias phrase recall
Minimal 1.7% performance degradation on general tests
Effective recognition of infrequent proper nouns
Abstract
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model. Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns. In this work, we propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases. Specifically, CBA utilizes the context vector of source attention in decoder to attend to a specific bias embedding. Jointly learned with the basic AED parameters, CBA can tell the model when and where to bias its output probability distribution. At inference stage, a list of bias phrases is preloaded and we adapt the posterior distributions of both CTC and attention decoder according to the attended bias phrase of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
