HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters
Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, Qinglin Lu

TL;DR
HunyuanVideo-Avatar is a multimodal diffusion transformer model that generates high-fidelity, emotion-controllable, multi-character dialogue videos with improved consistency and realism, addressing key challenges in audio-driven human animation.
Contribution
It introduces three innovations: a character image injection module, an emotion transfer module, and a face-aware audio adapter, enabling dynamic, emotion-aligned, multi-character video generation.
Findings
Surpasses state-of-the-art on benchmark datasets
Generates realistic, emotion-aligned multi-character videos
Effective in dynamic and immersive scenarios
Abstract
Recent years have witnessed significant progress in audio-driven human animation. However, critical challenges remain in (i) generating highly dynamic videos while preserving character consistency, (ii) achieving precise emotion alignment between characters and audio, and (iii) enabling multi-character audio-driven animation. To address these challenges, we propose HunyuanVideo-Avatar, a multimodal diffusion transformer (MM-DiT)-based model capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos. Concretely, HunyuanVideo-Avatar introduces three key innovations: (i) A character image injection module is designed to replace the conventional addition-based character conditioning scheme, eliminating the inherent condition mismatch between training and inference. This ensures the dynamic motion and strong character consistency; (ii) An Audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation
MethodsDiffusion · Adapter
