FaceLLM: A Multimodal Large Language Model for Face Understanding
Hatef Otroshi Shahreza, S\'ebastien Marcel

TL;DR
FaceLLM is a specialized multimodal large language model trained on a novel face-focused dataset, significantly improving performance on facial understanding tasks by leveraging synthetic supervision from language models.
Contribution
This work introduces FaceLLM and a new dataset, FairFaceGPT, enabling domain-specific facial understanding with weakly supervised question-answer pairs.
Findings
FaceLLM achieves state-of-the-art results on face-centric tasks.
The weakly supervised pipeline effectively generates high-quality training data.
Synthetic supervision enhances domain-specific multimodal model performance.
Abstract
Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis
