Improving face generation quality and prompt following with synthetic captions
Michail Tarasiou, Stylianos Moschoglou, Jiankang Deng, Stefanos, Zafeiriou

TL;DR
This paper introduces a training-free pipeline to generate synthetic captions for face datasets, which, when used to fine-tune diffusion models, significantly improves the realism and prompt adherence in face generation.
Contribution
The authors propose a novel, training-free method to create synthetic captions for face images, enhancing diffusion models' ability to generate realistic human faces aligned with prompts.
Findings
Improved face generation quality and realism
Enhanced prompt adherence in generated images
Synthetic captions effectively fine-tune diffusion models
Abstract
Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization
MethodsDiffusion
