Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation
Kaiwen Jiang, Xueting Li, Seonwook Park, Ravi Ramamoorthi, Shalini De Mello, Koki Nagano

TL;DR
This paper introduces a fast, expressive 3D-consistent portrait animation method that distills knowledge from 2D diffusion models into a lightweight encoder, enabling real-time animation from a single image with high quality.
Contribution
It presents a novel distillation approach that combines the speed of 3D-aware methods with the expressive detail of 2D diffusion models, avoiding reliance on parametric face models.
Findings
Runs at 107.31 FPS for animation and pose control
Achieves comparable quality to state-of-the-art methods
Uses an efficient local fusion strategy for 3D structural and animation information
Abstract
Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Emotion and Mood Recognition
