EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

Linrui Tian; Siqi Hu; Qi Wang; Bang Zhang; Liefeng Bo

arXiv:2501.10687·cs.CV·January 22, 2025

EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

Linrui Tian, Siqi Hu, Qi Wang, Bang Zhang, Liefeng Bo

PDF

Open Access

TL;DR

This paper introduces EMO2, a two-stage audio-driven avatar video generation method that produces expressive facial expressions and hand gestures, outperforming existing approaches in quality and synchronization.

Contribution

The paper presents a novel two-stage framework for co-speech gesture generation, focusing on hand pose synthesis from audio and integrating it with video synthesis using diffusion models.

Findings

01

Outperforms state-of-the-art methods like CyberHost and Vlogger.

02

Achieves higher visual quality and synchronization accuracy.

03

Effectively generates expressive facial and hand gestures from audio.

Abstract

In this paper, we propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures. Unlike existing methods that focus on generating full-body or half-body poses, we investigate the challenges of co-speech gesture generation and identify the weak correspondence between audio features and full-body gestures as a key limitation. To address this, we redefine the task as a two-stage process. In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements. In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements. Our experimental results demonstrate that the proposed method outperforms state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music Technology and Sound Studies

MethodsDiffusion · Focus