Listen with Intent: Improving Speech Recognition with Audio-to-Intent   Front-End

Swayambhu Nath Ray; Minhua Wu; Anirudh Raju; Pegah Ghahremani,; Raghavendra Bilgi; Milind Rao; Harish Arsikere; Ariya Rastrow; Andreas; Stolcke; Jasha Droppo

arXiv:2105.07071·eess.AS·February 22, 2022

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani,, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas, Stolcke, Jasha Droppo

PDF

TL;DR

This paper introduces a novel approach to improve speech recognition by incorporating explicit intent representations, derived from an audio-to-intent model, into RNN-T based ASR systems, resulting in notable WER reductions.

Contribution

It is the first to integrate intent embeddings as auxiliary inputs in RNN-T ASR systems, enhancing accuracy especially in streaming scenarios.

Findings

01

Non-streaming intent-based system reduces WER by 5.56%.

02

Streaming intent-based system reduces WER by 3.33%.

03

Significant improvements on media-related intents, e.g., 9.12% WERR on PlayMusicIntent.

Abstract

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.