EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
Linrui Tian, Siqi Hu, Qi Wang, Bang Zhang, Liefeng Bo

TL;DR
This paper introduces EMO2, a two-stage audio-driven avatar video generation method that produces expressive facial expressions and hand gestures, outperforming existing approaches in quality and synchronization.
Contribution
The paper presents a novel two-stage framework for co-speech gesture generation, focusing on hand pose synthesis from audio and integrating it with video synthesis using diffusion models.
Findings
Outperforms state-of-the-art methods like CyberHost and Vlogger.
Achieves higher visual quality and synchronization accuracy.
Effectively generates expressive facial and hand gestures from audio.
Abstract
In this paper, we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of co-speech gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation. To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements. Our experimental results demonstrate that the proposed method outperforms state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Music Technology and Sound Studies
MethodsDiffusion · Focus
