AnyFace: Free-style Text-to-Face Synthesis and Manipulation
Jianxin Sun, Qiyao Deng, Qi Li, Muyi Sun, Min Ren, Zhenan Sun

TL;DR
AnyFace is a novel free-style text-to-face synthesis method that enables high-quality, diverse face generation and manipulation from arbitrary descriptions, expanding applications in metaverse, social media, and forensics.
Contribution
The paper introduces a two-stream framework with CLIP-based feature extraction, a Cross Modal Distillation module, and a Diverse Triplet Loss for the first free-style text-to-face synthesis approach.
Findings
Outperforms state-of-the-art methods on CelebA-HQ and CelebA-Text-HQ datasets.
Achieves high-resolution, diverse, and constraint-free face synthesis.
Demonstrates broad applicability in various real-world scenarios.
Abstract
Existing text-to-image synthesis methods generally are only applicable to words in the training dataset. However, human faces are so variable to be described with limited words. So this paper proposes the first free-style text-to-face method namely AnyFace enabling much wider open world applications such as metaverse, social media, cosmetics, forensics, etc. AnyFace has a novel two-stream framework for face image synthesis and manipulation given arbitrary descriptions of the human face. Specifically, one stream performs text-to-face generation and the other conducts face image reconstruction. Facial text and image features are extracted using the CLIP (Contrastive Language-Image Pre-training) encoders. And a collaborative Cross Modal Distillation (CMD) module is designed to align the linguistic and visual features across these two streams. Furthermore, a Diverse Triplet Loss (DT loss)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsALIGN · Contrastive Language-Image Pre-training · Triplet Loss
