Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio   Representation

Anna Deichler; Shivam Mehta; Simon Alexanderson; Jonas Beskow

arXiv:2309.05455·eess.AS·September 12, 2023

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow

PDF

TL;DR

This paper presents a diffusion-based system for generating semantically meaningful co-speech gestures by jointly modeling speech and motion, achieving high human-likeness and speech appropriateness in evaluations.

Contribution

It introduces a contrastive speech and motion pretraining module that learns a joint embedding, enhancing semantic coherence in gesture generation.

Findings

01

Achieved highest human-likeness rating

02

Achieved highest speech appropriateness rating

03

Demonstrated effectiveness in the GENEA Challenge 2023

Abstract

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.