TL;DR
FastAvatar introduces a unified, fast 3D avatar reconstruction framework leveraging a large transformer model that efficiently integrates multi-view data to produce high-quality 3D Gaussian models within seconds.
Contribution
The paper presents FastAvatar, a novel transformer-based framework that enables rapid, high-quality 3D avatar reconstruction from diverse data sources using a single unified model.
Findings
FastAvatar achieves higher quality than existing methods.
It reconstructs 3D avatars within seconds.
The method is highly adaptable to different input types.
Abstract
Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs; Third,…
Peer Reviews
Decision·ICLR 2026 Poster
- FastAvatar uses a single unified model capable of processing diverse daily recordings, including a single image, multi-view observations, or monocular video. FastAvatar demonstrates greater model flexibility and higher data utilization efficiency compared to other feedforward methods like LAM or Avat3r. - Sliced Fusion Loss is a key component of the FastAvatar framework. This loss enables the model $G$ to leverage richer information from multiple inputs and to handle an arbitrary number of fr
- Although the method handles variable lengths, the practical input size N is explicitly limited. This constraint (max 16 frames) suggests that processing longer videos (which previously required 30 seconds at 25fps, for optimization-based methods) still requires sampling or chunking, potentially limiting true incremental modeling for extended footage. - The precision of these proxy models (FLAME/3DMM) is known to be sensitive to limitations like representational capacity and data quality, ofte
- Practical motivation: Addresses fast, unified avatar reconstruction from variable-length inputs without per-identity optimization. - Architecture: Extends VGGT with multi-granular encodings (pose, expression, camera) suitable for dynamic faces. - Loss design: The sliced fusion and landmark tracking losses are reasonable to promote frame consistency and alignment. - Feedforward inference potentially enables better avatar generation compared to optimization-based methods.
- Dependence on external camera pose tracking: Unlike VGGT, which infers relative geometry and camera pose implicitly through attention, FastAvatar requires explicit camera parameters and FLAME-derived head/expression tracking as inputs. This reliance on external preprocessing weakens the method’s claim of being a fully feed-forward, generalizable system. In practice, the need for accurate tracking limits applicability in real-world scenarios (e.g., in-the-wild videos) and reduces the robustness
- The proposed LGRT architecture is effective in aggregating multi-view / multiple image cues and aligning variable-length inputs, achieving consistent geometric and appearance coherence across frames. - The method shows substantial quantitative improvements in PSNR/SSIM/LPIPS, outperforming existing baselines across all view settings (1, 4, 8, and 16 frames).
- The tracking loss (Eq. 9) includes the term $y$, but its definition is missing. It should explicitly state how ground-truth and predicted landmarks $y_{j,i}$, $\hat{y}_{j,i}$ are obtained. - Similarly, Eq. 10 introduces $L_{\text{mask}}$, but no definition or explanation of this loss term is provided. A clarification of its role and formulation is necessary. - The framework claims to support arbitrary input lengths, but experiments are limited to at most 16 views. It is unclear whether the mod
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
