DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven   Holistic 3D Expression and Gesture Generation

Junming Chen; Yunfei Liu; Jianan Wang; Ailing Zeng; Yu Li; Qifeng Chen

arXiv:2401.04747·cs.SD·April 9, 2024·1 cites

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, Qifeng Chen

PDF

Open Access

TL;DR

DiffSHEG introduces a diffusion-based model for real-time, synchronized 3D expression and gesture generation driven by speech, outperforming prior methods in quality and efficiency.

Contribution

It presents a novel diffusion-based transformer model for joint speech-driven expression and gesture generation with arbitrary length, including an outpainting sampling strategy.

Findings

01

Achieves state-of-the-art quantitative and qualitative results

02

Produces high-quality, synchronized 3D expressions and gestures

03

Validated by user study confirming superiority over prior approaches

Abstract

We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Human Motion and Animation

MethodsDiffusion