HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching
Lanmiao Liu, Esam Ghaleb, Asl{\i} \"Ozy\"urek, and Zerrin Yumak

TL;DR
This paper introduces a contrastive flow-matching model for holistic co-speech gesture generation that improves semantic grounding and cross-modal coherence, outperforming existing methods on multiple datasets.
Contribution
The paper proposes a novel contrastive flow-matching approach that incorporates negative samples and joint embedding of text, audio, and motion for better semantic and cross-modal gesture generation.
Findings
Outperforms state-of-the-art methods on BEAT2 and SHOW datasets.
Effectively models iconic and metaphoric gestures with semantic grounding.
Ensures cross-modal coherence through joint embedding and contrastive learning.
Abstract
While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
