Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters
Steven Hogue, Chenxu Zhang, Yapeng Tian, Xiaohu Guo

TL;DR
This paper introduces a unified diffusion-based model with adapters that jointly generates co-speech gestures and talking head movements, reducing complexity and parameter count while maintaining high-quality output.
Contribution
It presents a novel single-network architecture that models face and body movements together using shared weights and adapters, improving efficiency and coherence.
Findings
Achieves state-of-the-art performance in co-speech gesture and talking head generation.
Reduces model parameters significantly compared to separate models.
Maintains high-quality, synchronized face and body motion generation.
Abstract
Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis
MethodsFocus
