PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
Zhi-Yi Lin, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang

TL;DR
PolySLGen is an online framework that generates contextually appropriate, multimodal reactions in group interactions, incorporating speech, body motion, and social cues for more natural AI-human social engagement.
Contribution
It introduces a novel pose fusion module and social cue encoder for effective modeling of polyadic group interactions in multimodal reaction generation.
Findings
Outperforms state-of-the-art baselines in motion quality and speech alignment.
Produces more human-like, coherent reactions in group social scenarios.
Demonstrates effectiveness through extensive quantitative and qualitative evaluations.
Abstract
Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
