VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo

TL;DR
VASA-3D introduces a novel method for creating realistic, audio-driven 3D head avatars from a single image, enabling high-quality, real-time free-viewpoint video generation for immersive applications.
Contribution
The paper presents a new approach that combines motion latent transfer and optimization to generate detailed, lifelike 3D head avatars from a single portrait image.
Findings
Produces realistic 3D talking heads with subtle expression details.
Supports online generation of 512x512 videos at up to 75 FPS.
Outperforms prior methods in realism and efficiency.
Abstract
We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Face Recognition and Perception
