Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani,, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas, Stolcke, Jasha Droppo

TL;DR
This paper introduces a novel approach to improve speech recognition by incorporating explicit intent representations, derived from an audio-to-intent model, into RNN-T based ASR systems, resulting in notable WER reductions.
Contribution
It is the first to integrate intent embeddings as auxiliary inputs in RNN-T ASR systems, enhancing accuracy especially in streaming scenarios.
Findings
Non-streaming intent-based system reduces WER by 5.56%.
Streaming intent-based system reduces WER by 3.33%.
Significant improvements on media-related intents, e.g., 9.12% WERR on PlayMusicIntent.
Abstract
Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
