Audio-to-Intent Using Acoustic-Textual Subword Representations from   End-to-End ASR

Pranay Dighe; Prateeth Nayak; Oggi Rudovic; Erik Marchi; Xiaochuan; Niu; Ahmed Tewfik

arXiv:2210.12134·cs.CL·October 24, 2022

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

Pranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan, Niu, Ahmed Tewfik

PDF

Open Access

TL;DR

This paper introduces a novel method for predicting user intent in voice assistants by leveraging acoustic and textual subword representations from end-to-end ASR, improving robustness and accuracy in intent detection.

Contribution

The paper presents a new approach combining acoustic and textual subword features with positional encoding for more effective intent classification from speech.

Findings

01

Achieves 93.3% accuracy in filtering unintended user audio

02

Provides more robust intent representations than previous methods

03

Effectively combines acoustic and textual subword information

Abstract

Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable "sub"-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced vocabulary space than of the full words. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · AI in Service Interactions · Speech and dialogue systems