Towards Effective and Compact Contextual Representation for Conformer   Transducer Speech Recognition Systems

Mingyu Cui; Jiawen Kang; Jiajun Deng; Xi Yin; Yutao Xie; Xie Chen,; Xunying Liu

arXiv:2306.13307·eess.AS·June 27, 2023

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

Mingyu Cui, Jiawen Kang, Jiajun Deng, Xi Yin, Yutao Xie, Xie Chen,, Xunying Liu

PDF

Open Access

TL;DR

This paper introduces a novel method for deriving compact, low-dimensional contextual representations in Conformer-Transducer speech recognition systems, leading to improved accuracy by effectively utilizing cross-utterance context.

Contribution

It proposes a new attention pooling approach to learn efficient cross-utterance contextual features, outperforming previous methods in streaming speech recognition.

Findings

01

Achieved statistically significant WER reductions of up to 0.7% absolute.

02

Demonstrated improvements on the Gigaspeech corpus with 1000 hours of data.

03

Outperformed baseline models using internal context only.

Abstract

Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of transformer context embeddings, in this paper compact low-dimensional cross utterance contextual features are learned in the Conformer-Transducer Encoder using specially designed attention pooling layers that are applied over efficiently cached preceding utterances history vectors. Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline using utterance internal context only with statistically significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsAttention Pooling