FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

Haoran Wang; Mohit Mendiratta; Christian Theobalt; Adam Kortylewski

arXiv:2406.07163·cs.CV·June 12, 2024

FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

Haoran Wang, Mohit Mendiratta, Christian Theobalt, Adam Kortylewski

PDF

Open Access 3 Reviews

TL;DR

FaceGPT is a self-supervised framework that enables large vision-language models to reason about and generate 3D human faces from images and text, combining 3D face modeling with semantic understanding.

Contribution

It embeds 3D morphable face model parameters into a vision-language model's token space, enabling 3D face reconstruction and reasoning without requiring 3D annotations.

Findings

01

Achieves high-quality 3D face reconstructions from images and text.

02

Retains general visual instruction-following capabilities.

03

Operates fully self-supervised without 3D face annotations.

Abstract

We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

The combination of VLMs with face-related tasks has not been explored in literature and in its current instantiation in this paper presents some amount of novelty. Moreover, training the VLM with a face reconstruction training objective in a self-supervised manner bears some degree of novelty.

Weaknesses

Unfortunately, I cannot grasp the motivation behind the proposed work as in the end of the end day it boils down how to fine-tune a VLM for 3D face reconstruction. But there are already several state-of-the-art methods of high accuracy for this task. Similarly there are already several methods for text-driven face generation. It's not clear if the proposed method is any better than methods tailored to these tasks. Importantly, these are vision tasks so it is unclear why a VLM is needed and what

Reviewer 02Rating 5Confidence 5

Strengths

1. The paper is well written and easy to follow. 2. The paper proposed a framework that can leverage large VLMs to generate 3D faces from natural description of emotions. 3. The framework doesn't require any coupled text and 3D face data. 4. The framework achieved good 3D face reconstruction results.

Weaknesses

1. Insufficient Justification for Using VLMs: The paper does not provide adequate justification for employing Visual Language Models (VLMs) in the 3D face synthesis task. The outcomes presented could potentially be replicated by many existing methods if trained under a conditional framework incorporating a CLIP text encoder along with detailed textual descriptions. 2. Subpar Quality of Generated Faces: The quality of the generated faces significantly lags behind the current state-of-the-art fac

Reviewer 03Rating 3Confidence 4

Strengths

This paper presents a valuable topic: constructing a unified model for generating 3D faces from both images and texts. Specifically, speculative face generation holds significant value in fields such as criminal tracking. The experiments also demonstrate the effectiveness of the constructed model in speculative face generation, explicit text-based 3D face generation, and image-based 3D face reconstruction.

Weaknesses

The core idea of self-supervised learning is to set up proxy tasks that allow the model to train and capture the intrinsic structure and features of the data in the process. Although the paper claims to use a self-supervised learning framework, there seems to be some deviation from the conventional definition of self-supervised learning. Based on the details of training and data construction in the paper, the method employed appears to be a straightforward supervised learning approach, similar t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques