CIF-based Collaborative Decoding for End-to-end Contextual Speech Recognition
Minglun Han, Linhao Dong, Shiyu Zhou, Bo Xu

TL;DR
This paper introduces a CIF-based collaborative decoding method that effectively incorporates and controls contextual information in end-to-end speech recognition, significantly reducing errors on named entity-rich datasets.
Contribution
It proposes a novel context processing network that enables controllable contextual biasing in CIF-based models, improving recognition accuracy without degrading original performance.
Findings
Achieved 8.83% and 21.13% CER reduction on HKUST and AISHELL-2 datasets.
Reduced named entity character errors by over 40% and 50%.
Maintained baseline performance on original evaluation sets.
Abstract
End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream. However, the unified structure and the E2E training hamper injecting contextual information into them for contextual biasing. Though contextual LAS (CLAS) gives an excellent all-neural solution, the degree of biasing to given context information is not explicitly controllable. In this paper, we focus on incorporating context information into the continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion. Specifically, an extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution, thus forming a collaborative decoding with the decoder of the CIF-based model. Evaluated on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
