Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang, Zhou, Siyu Zhu, Jingdong Wang

TL;DR
Hallo2 advances portrait image animation by enabling long-duration, high-resolution (4K) video synthesis driven by audio and textual controls, overcoming previous limitations in temporal coherence and resolution.
Contribution
The paper introduces Hallo2, the first method to produce 4K, hour-long, audio-driven portrait animations with semantic textual control, extending capabilities of prior models like Hallo.
Findings
Achieves 4K resolution in long-duration videos
Maintains visual consistency over extended durations
Outperforms existing methods in quality and controllability
Abstract
Recent advances in latent diffusion-based generative models for portrait image animation, such as Hallo, have achieved impressive results in short-duration video synthesis. In this paper, we present updates to Hallo, introducing several design enhancements to extend its capabilities. First, we extend the method to produce long-duration videos. To address substantial challenges such as appearance drift and temporal artifacts, we investigate augmentation strategies within the image space of conditional motion frames. Specifically, we introduce a patch-drop technique augmented with Gaussian noise to enhance visual consistency and temporal coherence over long duration. Second, we achieve 4K resolution portrait video generation. To accomplish this, we implement vector quantization of latent codes and apply temporal alignment techniques to maintain coherence across the temporal dimension. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
