Device Directedness with Contextual Cues for Spoken Dialog Systems
Dhanush Bekal, Sundararajan Srinivasan, Sravan Bodapati, Srikanth, Ronanki, Katrin Kirchhoff

TL;DR
This paper introduces a speech-based barge-in verification model that leverages self-supervised speech representations and lexical infusion, achieving faster and more accurate classification in spoken dialog systems.
Contribution
It proposes a novel method to incorporate lexical information into speech representations for improved barge-in verification in dialog systems.
Findings
38% faster inference compared to baseline
4.5% F1 score improvement over audio-only baseline
Additional 5.7% F1 score gain with lexical infusion
Abstract
In this work, we define barge-in verification as a supervised learning task where audio-only information is used to classify user spoken dialogue into true and false barge-ins. Following the success of pre-trained models, we use low-level speech representations from a self-supervised representation learning model for our downstream classification task. Further, we propose a novel technique to infuse lexical information directly into speech representations to improve the domain-specific language information implicitly learned during pre-training. Experiments conducted on spoken dialog data show that our proposed model trained to validate barge-in entirely from speech representations is faster by 38% relative and achieves 4.5% relative F1 score improvement over a baseline LSTM model that uses both audio and Automatic Speech Recognition (ASR) 1-best hypotheses. On top of this, our best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
