Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
Sam O'Connor Russell, Naomi Harte

TL;DR
This paper presents MM-VAP, a multimodal predictive turn-taking model that combines speech with visual cues like facial expression, head pose, and gaze, significantly improving accuracy in human interaction scenarios.
Contribution
Introduction of MM-VAP, the first comprehensive multimodal predictive turn-taking model integrating visual cues with speech for improved accuracy.
Findings
MM-VAP outperforms audio-only models in videoconferencing (84% vs. 79% accuracy)
Visual cues, especially facial expressions, significantly enhance turn-taking prediction
Model performs consistently across different silence durations between turns
Abstract
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Impairment and Communication · Language, Metaphor, and Cognition · Digital Communication and Language
