TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio   Motion Embedding and Diffusion Interpolation

Haiyang Liu; Xingchao Yang; Tomoya Akiyama; Yuantian Huang; Qiaoge Li,; Shigeru Kuriyama; Takafumi Taketomi

arXiv:2410.04221·cs.CV·October 8, 2024·2 cites

TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation

Haiyang Liu, Xingchao Yang, Tomoya Akiyama, Yuantian Huang, Qiaoge Li,, Shigeru Kuriyama, Takafumi Taketomi

PDF

Open Access

TL;DR

TANGO is a novel framework that generates realistic co-speech gesture videos by combining hierarchical audio-motion embedding, diffusion-based transition frame generation, and a graph retrieval system to ensure synchronization and visual quality.

Contribution

It introduces a hierarchical joint embedding space (AuMoCLIP) for better cross-modal alignment and a diffusion-based model (ACInterp) for high-quality transition frames in gesture video reenactment.

Findings

01

Outperforms existing methods in realism and synchronization

02

Achieves high-fidelity, audio-synchronized gesture videos

03

Effectively reduces visual artifacts in generated videos

Abstract

We present TANGO, a framework for generating co-speech body-gesture videos. Given a few-minute, single-speaker reference video and target speech audio, TANGO produces high-fidelity videos with synchronized body gestures. TANGO builds on Gesture Video Reenactment (GVR), which splits and retrieves video clips using a directed graph structure - representing video frames as nodes and valid transitions as edges. We address two key limitations of GVR: audio-motion misalignment and visual artifacts in GAN-generated transition frames. In particular, (i) we propose retrieving gestures using latent feature distance to improve cross-modal alignment. To ensure the latent features could effectively model the relationship between speech audio and gesture motion, we implement a hierarchical joint embedding space (AuMoCLIP); (ii) we introduce the diffusion-based model to generate high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Video Analysis and Summarization

MethodsDiffusion