When can I Speak? Predicting initiation points for spoken dialogue agents
Siyan Li, Ashwin Paranjape, Christopher D. Manning

TL;DR
This paper presents a method to predict when a spoken dialogue agent should initiate speaking, using prosodic and linguistic features to enable more natural and timely responses, outperforming traditional silence-based triggers.
Contribution
It introduces a novel approach combining prosodic features from wav2vec 1.0 and language features from GPT-2 to predict initiation points in dialogue, improving response timing.
Findings
Outperforms prior feature-based methods on lead-time prediction metrics.
Significantly better than waiting for 700ms silence before responding.
Effective on the Switchboard Corpus.
Abstract
Current spoken dialogue systems initiate their turns after a long period of silence (700-1000ms), which leads to little real-time feedback, sluggish responses, and an overall stilted conversational flow. Humans typically respond within 200ms and successfully predicting initiation points in advance would allow spoken dialogue agents to do the same. In this work, we predict the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from a pre-trained language model (GPT-2) operating on incremental transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and true lead times. We train and evaluate the models on the Switchboard Corpus and find that our method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques
