CIF-based Collaborative Decoding for End-to-end Contextual Speech   Recognition

Minglun Han; Linhao Dong; Shiyu Zhou; Bo Xu

arXiv:2012.09466·cs.CL·February 19, 2021

CIF-based Collaborative Decoding for End-to-end Contextual Speech Recognition

Minglun Han, Linhao Dong, Shiyu Zhou, Bo Xu

PDF

Open Access

TL;DR

This paper introduces a CIF-based collaborative decoding method that effectively incorporates and controls contextual information in end-to-end speech recognition, significantly reducing errors on named entity-rich datasets.

Contribution

It proposes a novel context processing network that enables controllable contextual biasing in CIF-based models, improving recognition accuracy without degrading original performance.

Findings

01

Achieved 8.83% and 21.13% CER reduction on HKUST and AISHELL-2 datasets.

02

Reduced named entity character errors by over 40% and 50%.

03

Maintained baseline performance on original evaluation sets.

Abstract

End-to-end (E2E) models have achieved promising results on multiple speech recognition benchmarks, and shown the potential to become the mainstream. However, the unified structure and the E2E training hamper injecting contextual information into them for contextual biasing. Though contextual LAS (CLAS) gives an excellent all-neural solution, the degree of biasing to given context information is not explicitly controllable. In this paper, we focus on incorporating context information into the continuous integrate-and-fire (CIF) based model that supports contextual biasing in a more controllable fashion. Specifically, an extra context processing network is introduced to extract contextual embeddings, integrate acoustically relevant context information and decode the contextual output distribution, thus forming a collaborative decoding with the decoder of the CIF-based model. Evaluated on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing