ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input
Hendric Voss, Stefan Kopp

TL;DR
ImaGGen introduces a zero-shot system that generates semantically meaningful co-speech gestures by integrating language and visual inputs, enhancing virtual agent expressiveness and human-agent communication.
Contribution
The paper presents a novel zero-shot gesture generation approach that combines language and image analysis to produce contextually appropriate semantic gestures without manual annotation.
Findings
Generated gestures improved object property identification in user studies.
System effectively synthesizes iconic and deictic gestures aligned with speech.
Highlights the importance of visual context in gesture generation for communication.
Abstract
Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Hand Gesture Recognition Systems
