FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation
Xinya Ji, Sebastian Weiss, Manuel Kansy, Jacek Naruniec, Xun Cao, Barbara Solenthaler, Derek Bradley

TL;DR
FastGHA introduces a rapid, high-quality 3D Gaussian head avatar generation method from minimal input images, enabling real-time animation and outperforming existing techniques in efficiency and visual fidelity.
Contribution
The paper presents a novel feed-forward approach that generates 3D Gaussian head avatars from few images with real-time animation capabilities, using transformer-based fusion and dynamic deformation prediction.
Findings
Outperforms existing methods in rendering quality.
Supports real-time dynamic avatar animation.
Achieves high efficiency with minimal input images.
Abstract
Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based…
Peer Reviews
Decision·ICLR 2026 Poster
* The method demonstrates satisfactory few-shot reconstruction quality and visualization effects. * The method's reconstruction design is reasonable, effectively integrating information from different views, and can be well driven by new expressions.
* This method may struggle with "unreasonable" user input and the paper does not show these results. Furthermore, reasonable input is difficult to define. For example, if user input from a particular perspective is missing, the result is currently unknown. * This method may struggle to handle an arbitrary number of input viewpoints. Because it relies on VxHxW self-attention, too many views exponentially increase the computational cost of this part. * Due to the linear increase in the number of
- The task of accessable avatar creation from a few selfie images finds importance in many down-stream applications. Therefore, the improved visual quality and faster animation speed immediatley become more significant. - Smart and simple usege of pretrained VAE weights for encoder/decoder, which is also ablated to be beneficial. The same can be said for the VGGT supervision. - Overall the architecture seems to be slightly simplified comapred to Avat3r, which seems quite helful for future impro
- The main weakness of the paper that I am still seeing is the limited novelty, since it mainly introduces some technical changes compares to Avat3r. However, the quality and animation speed are improved, and the paper is well evaluted. Therefore, we can be sure that the method solidly advances the field on such a highly relevant task. Therefore, I don't mind the limited novelty. - Currently, the method is limited by FLAME expression codes. However, anything else would likely be out-of-scope for
- The paper presents a clear methodology and is well-structured. - FastGHA employs a feed-forward approach to directly predict a per-pixel Gaussian representation. This design is explicitly chosen to enable instantaneous reconstruction of unseen subjects, avoiding lengthy per-identity optimization or template Gaussians rigged to 3DMMs required by many prior methods. Previous approaches struggled to handle fluffy elements like hair due to the rigging constraint. - The method significantly address
- The paper identifies Avat3r as the "most related" state-of-the-art method. However, the crucial quantitative comparison table (Table 1) lacks the results for Avat3r on the Nersemble dataset. This prevents a full comparison, especially given that FastGHA's best performance is achieved when training on both Ava-256 and Nersemble ("Ours (both)"). - The animation pipeline depends on the accuracy of FLAME expression codes ($z_{exp}$) obtained using "off-the-shelf head tracking tools". Errors genera
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
