Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Sam O'Connor Russell; Naomi Harte

arXiv:2505.21043·cs.CL·October 27, 2025

Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Sam O'Connor Russell, Naomi Harte

PDF

Open Access 1 Repo

TL;DR

This paper presents MM-VAP, a multimodal predictive turn-taking model that combines speech with visual cues like facial expression, head pose, and gaze, significantly improving accuracy in human interaction scenarios.

Contribution

Introduction of MM-VAP, the first comprehensive multimodal predictive turn-taking model integrating visual cues with speech for improved accuracy.

Findings

01

MM-VAP outperforms audio-only models in videoconferencing (84% vs. 79% accuracy)

02

Visual cues, especially facial expressions, significantly enhance turn-taking prediction

03

Model performs consistently across different silence durations between turns

Abstract

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

russelsa/mm-vap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHearing Impairment and Communication · Language, Metaphor, and Cognition · Digital Communication and Language