Contextual-Utterance Training for Automatic Speech Recognition

Alejandro Gomez-Alanis; Lukas Drude; Andreas Schwarz; Rupak Vignesh; Swaminathan; Simon Wiesler

arXiv:2210.16238·eess.AS·October 31, 2022

Contextual-Utterance Training for Automatic Speech Recognition

Alejandro Gomez-Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh, Swaminathan, Simon Wiesler

PDF

Open Access

TL;DR

This paper introduces a novel contextual-utterance training method for streaming ASR systems that leverages past and future context to improve word error rate and latency, outperforming traditional RNN-T training.

Contribution

It proposes a dual-mode training approach that distills contextual knowledge from a teacher model, enhancing streaming ASR performance with better context utilization.

Findings

01

Reduced WER by over 6%

02

Lowered last token emission latency by more than 40ms

03

Outperformed classical RNN-T training methods

Abstract

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing