FaceGPT: Self-supervised Learning to Chat about 3D Human Faces
Haoran Wang, Mohit Mendiratta, Christian Theobalt, Adam Kortylewski

TL;DR
FaceGPT is a self-supervised framework that enables large vision-language models to reason about and generate 3D human faces from images and text, combining 3D face modeling with semantic understanding.
Contribution
It embeds 3D morphable face model parameters into a vision-language model's token space, enabling 3D face reconstruction and reasoning without requiring 3D annotations.
Findings
Achieves high-quality 3D face reconstructions from images and text.
Retains general visual instruction-following capabilities.
Operates fully self-supervised without 3D face annotations.
Abstract
We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The combination of VLMs with face-related tasks has not been explored in literature and in its current instantiation in this paper presents some amount of novelty. Moreover, training the VLM with a face reconstruction training objective in a self-supervised manner bears some degree of novelty.
Unfortunately, I cannot grasp the motivation behind the proposed work as in the end of the end day it boils down how to fine-tune a VLM for 3D face reconstruction. But there are already several state-of-the-art methods of high accuracy for this task. Similarly there are already several methods for text-driven face generation. It's not clear if the proposed method is any better than methods tailored to these tasks. Importantly, these are vision tasks so it is unclear why a VLM is needed and what
1. The paper is well written and easy to follow. 2. The paper proposed a framework that can leverage large VLMs to generate 3D faces from natural description of emotions. 3. The framework doesn't require any coupled text and 3D face data. 4. The framework achieved good 3D face reconstruction results.
1. Insufficient Justification for Using VLMs: The paper does not provide adequate justification for employing Visual Language Models (VLMs) in the 3D face synthesis task. The outcomes presented could potentially be replicated by many existing methods if trained under a conditional framework incorporating a CLIP text encoder along with detailed textual descriptions. 2. Subpar Quality of Generated Faces: The quality of the generated faces significantly lags behind the current state-of-the-art fac
This paper presents a valuable topic: constructing a unified model for generating 3D faces from both images and texts. Specifically, speculative face generation holds significant value in fields such as criminal tracking. The experiments also demonstrate the effectiveness of the constructed model in speculative face generation, explicit text-based 3D face generation, and image-based 3D face reconstruction.
The core idea of self-supervised learning is to set up proxy tasks that allow the model to train and capture the intrinsic structure and features of the data in the process. Although the paper claims to use a self-supervised learning framework, there seems to be some deviation from the conventional definition of self-supervised learning. Based on the details of training and data construction in the paper, the method employed appears to be a straightforward supervised learning approach, similar t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
