ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

Hendric Voss; Stefan Kopp

arXiv:2510.17617·cs.HC·October 21, 2025

ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input

Hendric Voss, Stefan Kopp

PDF

Open Access

TL;DR

ImaGGen introduces a zero-shot system that generates semantically meaningful co-speech gestures by integrating language and visual inputs, enhancing virtual agent expressiveness and human-agent communication.

Contribution

The paper presents a novel zero-shot gesture generation approach that combines language and image analysis to produce contextually appropriate semantic gestures without manual annotation.

Findings

01

Generated gestures improved object property identification in user studies.

02

System effectively synthesizes iconic and deictic gestures aligned with speech.

03

Highlights the importance of visual context in gesture generation for communication.

Abstract

Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Hand Gesture Recognition Systems