Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu,, Jingdong Wang, Yao Yao, Siyu Zhu

TL;DR
This paper introduces a hierarchical audio-driven visual synthesis method using diffusion models for realistic, synchronized, and personalized portrait animation, improving quality and motion diversity.
Contribution
It presents an end-to-end diffusion-based framework with hierarchical control for enhanced synchronization and personalization in portrait animation.
Findings
Improved lip synchronization accuracy
Enhanced image and video quality
Greater motion diversity and personalization
Abstract
The field of portrait image animation, driven by speech audio input, has experienced significant advancements in the generation of realistic and dynamic portraits. This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations within the framework of diffusion-based methodologies. Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion. Our proposed network architecture seamlessly integrates diffusion-based generative models, a UNet-based denoiser, temporal alignment techniques, and a reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
