Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction
Simon Giebenhain, Tobias Kirschstein, Martin R\"unz, Lourdes Agapito,, Matthias Nie{\ss}ner

TL;DR
Pixel3DMM leverages vision transformers and foundation model features to improve single-image 3D face reconstruction, achieving higher geometric accuracy across diverse expressions and ethnicities.
Contribution
The paper introduces Pixel3DMM, a novel approach combining vision transformers and foundation model features for enhanced 3D face reconstruction from a single image.
Findings
Outperforms baselines by over 15% in geometric accuracy.
Introduces a new benchmark with diverse expressions and ethnicities.
Employs a novel FLAME fitting optimization for 3DMM parameters.
Abstract
We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Layer Normalization · Softmax · Residual Connection · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels · Sparse Evolutionary Training
