Contextual-Utterance Training for Automatic Speech Recognition
Alejandro Gomez-Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh, Swaminathan, Simon Wiesler

TL;DR
This paper introduces a novel contextual-utterance training method for streaming ASR systems that leverages past and future context to improve word error rate and latency, outperforming traditional RNN-T training.
Contribution
It proposes a dual-mode training approach that distills contextual knowledge from a teacher model, enhancing streaming ASR performance with better context utilization.
Findings
Reduced WER by over 6%
Lowered last token emission latency by more than 40ms
Outperformed classical RNN-T training methods
Abstract
Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
