The Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations
Sam OConnor Russell, Delphine Charuau, Naomi Harte

TL;DR
This study investigates how self-supervised speech representations utilize prosodic and lexical cues for turn-taking, revealing that either cue alone can support the task and that they are encoded with limited interdependence.
Contribution
The paper introduces a vocoder-based method to independently manipulate prosody and lexical cues in speech, enabling detailed analysis of their roles in turn-taking models based on self-supervised speech representations.
Findings
Both prosodic and lexical cues support turn-taking independently.
Models can rely on prosody alone, offering privacy benefits.
Prosodic and lexical cues are encoded with limited interdependence in S3Rs.
Abstract
Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Speech and dialogue systems · Phonetics and Phonology Research
