How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Manuel Brack; Sudeep Katakol; Felix Friedrich; Patrick Schramowski; Hareesh Ravi; Kristian Kersting; Ajinkya Kale

arXiv:2506.16679·cs.CV·June 23, 2025

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Manuel Brack, Sudeep Katakol, Felix Friedrich, Patrick Schramowski, Hareesh Ravi, Kristian Kersting, Ajinkya Kale

PDF

Open Access 2 Datasets

TL;DR

This paper systematically investigates how different synthetic captioning strategies affect the performance of text-to-image models, revealing trade-offs between caption quality, diversity, and output bias.

Contribution

It provides the first comprehensive analysis of synthetic caption design choices and their impact on model performance and output characteristics.

Findings

01

High-quality dense captions improve text alignment.

02

Randomized caption lengths balance aesthetics and diversity.

03

Caption distribution shifts significantly affect output bias.

Abstract

Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics