How Much Does Prosody Help Turn-taking? Investigations using Voice Activity Projection Models
Erik Ekstedt, Gabriel Skantze

TL;DR
This paper explores how prosody influences turn-taking in conversations by analyzing Voice Activity Projection models that learn speech activity patterns without explicit prosodic annotations.
Contribution
It demonstrates that Voice Activity Projection models implicitly utilize prosodic cues for turn-taking, revealing the role of prosody in conversational dynamics.
Findings
Models leverage prosodic features in long conversations.
Prosody impacts turn-shift predictions in specific utterances.
Voice Activity Projection models operate without explicit prosodic annotations.
Abstract
Turn-taking is a fundamental aspect of human communication and can be described as the ability to take turns, project upcoming turn shifts, and supply backchannels at appropriate locations throughout a conversation. In this work, we investigate the role of prosody in turn-taking using the recently proposed Voice Activity Projection model, which incrementally models the upcoming speech activity of the interlocutors in a self-supervised manner, without relying on explicit annotation of turn-taking events, or the explicit modeling of prosodic features. Through manipulation of the speech signal, we investigate how these models implicitly utilize prosodic information. We show that these systems learn to utilize various prosodic aspects of speech both on aggregate quantitative metrics of long-form conversations and on single utterances specifically designed to depend on prosody.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Phonetics and Phonology Research · Natural Language Processing Techniques
