Improving End-to-End Contextual Speech Recognition with Fine-Grained   Contextual Knowledge Selection

Minglun Han; Linhao Dong; Zhenlin Liang; Meng Cai; Shiyu Zhou; Zejun; Ma; Bo Xu

arXiv:2201.12806·cs.CL·March 3, 2022

Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection

Minglun Han, Linhao Dong, Zhenlin Liang, Meng Cai, Shiyu Zhou, Zejun, Ma, Bo Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces FineCoS, a fine-grained contextual knowledge selection method for end-to-end speech recognition, reducing confusion between similar phrases and improving accuracy on large datasets.

Contribution

It proposes a novel fine-grained knowledge selection approach that narrows phrase candidates and refines token attention, enhancing contextual biasing in speech recognition.

Findings

01

Achieved up to 6.1% WER reduction on LibriSpeech

02

Achieved up to 16.4% CER reduction on in-house dataset

03

Demonstrated effectiveness of FineCoS with collaborative decoding

Abstract

Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge. Since all-neural contextual biasing methods rely on phrase-level contextual modeling and attention-based relevance modeling, they may encounter confusion between similar context-specific phrases, which hurts predictions at the token level. In this work, we focus on mitigating confusion problems with fine-grained contextual knowledge selection (FineCoS). In FineCoS, we introduce fine-grained knowledge to reduce the uncertainty of token predictions. Specifically, we first apply phrase selection to narrow the range of phrase candidates, and then conduct token attention on the tokens in the selected phrase candidates. Moreover, we re-normalize the attention weights of most relevant phrases in inference to obtain more focused phrase-level contextual representations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minglunhan/cif-coldec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing