Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video   Diffusion Transformer

Jiahao Cui; Hui Li; Yun Zhan; Hanlin Shang; Kaihui Cheng; Yuqi Ma,; Shan Mu; Hang Zhou; Jingdong Wang; Siyu Zhu

arXiv:2412.00733·cs.CV·March 14, 2025

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma,, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu

PDF

Open Access 1 Repo 2 Models 2 Datasets

TL;DR

Hallo3 introduces a transformer-based video generative model that produces highly realistic and dynamic portrait animations, effectively handling diverse perspectives, backgrounds, and speech-driven motion, surpassing prior U-Net-based methods.

Contribution

The paper presents the first application of a pretrained transformer-based video model for portrait animation, with a novel identity reference network and speech conditioning mechanisms.

Findings

01

Outperforms prior methods in realism and diversity of portrait videos

02

Successfully handles non-frontal perspectives and dynamic backgrounds

03

Demonstrates strong generalization on benchmark and wild datasets

Abstract

Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fudan-generative-vision/hallo3
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques · Computer Graphics and Visualization Techniques