Device Directedness with Contextual Cues for Spoken Dialog Systems

Dhanush Bekal; Sundararajan Srinivasan; Sravan Bodapati; Srikanth; Ronanki; Katrin Kirchhoff

arXiv:2211.13280·cs.CL·November 28, 2022

Device Directedness with Contextual Cues for Spoken Dialog Systems

Dhanush Bekal, Sundararajan Srinivasan, Sravan Bodapati, Srikanth, Ronanki, Katrin Kirchhoff

PDF

Open Access

TL;DR

This paper introduces a speech-based barge-in verification model that leverages self-supervised speech representations and lexical infusion, achieving faster and more accurate classification in spoken dialog systems.

Contribution

It proposes a novel method to incorporate lexical information into speech representations for improved barge-in verification in dialog systems.

Findings

01

38% faster inference compared to baseline

02

4.5% F1 score improvement over audio-only baseline

03

Additional 5.7% F1 score gain with lexical infusion

Abstract

In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 score improvement over a baseline LSTM model that uses both audio and Automatic Speech Recognition (ASR) 1-best hypotheses. On top of this, our best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory