Modeling Turn-Taking with Semantically Informed Gestures
Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg

TL;DR
This paper introduces a new multimodal dataset with semantic gesture annotations and demonstrates that including gestures improves turn-taking prediction in conversation models.
Contribution
The study presents DnD Gesture++, a richly annotated dataset, and a Mixture-of-Experts model that effectively integrates gestures with speech and audio for turn-taking prediction.
Findings
Gestures provide complementary cues for turn-taking.
Incorporating gestures improves prediction accuracy.
Semantic gesture annotations enhance model performance.
Abstract
In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHearing Impairment and Communication · Action Observation and Synchronization · Speech and dialogue systems
