VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image

Sicheng Xu; Guojun Chen; Jiaolong Yang; Yizhong Zhang; Yu Deng; Steve Lin; Baining Guo

arXiv:2512.14677·cs.CV·December 17, 2025

VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image

Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo

PDF

Open Access

TL;DR

VASA-3D introduces a novel method for creating realistic, audio-driven 3D head avatars from a single image, enabling high-quality, real-time free-viewpoint video generation for immersive applications.

Contribution

The paper presents a new approach that combines motion latent transfer and optimization to generate detailed, lifelike 3D head avatars from a single portrait image.

Findings

01

Produces realistic 3D talking heads with subtle expression details.

02

Supports online generation of 512x512 videos at up to 75 FPS.

03

Outperforms prior methods in realism and efficiency.

Abstract

We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Face Recognition and Perception