TL;DR
This paper explores speech features for continuous turn-taking prediction in dialog systems using LSTMs, aiming to improve fluidity and overlap handling beyond traditional end-of-turn models.
Contribution
It identifies effective speech-related features for turn prediction and demonstrates that LSTM-based models outperform previous baselines in this task.
Findings
Traditional acoustic features perform well for turn prediction.
Word features outperform part-of-speech features.
LSTM models outperform previous baselines.
Abstract
For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
