Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation
Xinran Wang, Muxi Diao, Yuanzhi Liu, Chunyu Wang, Kongming Liang, Zhanyu Ma, Jun Guo

TL;DR
This paper introduces a new metric for caption detailness based on image coverage and object detailness, improving data efficiency and generation quality in text-to-image models by selecting more informative captions.
Contribution
It proposes a novel metric for caption detailness that outperforms length-based heuristics, enabling more effective data selection for T2I training.
Findings
High-ICR and -AOD captions lead to better T2I performance.
Training on 20% of data with the new metric surpasses full dataset results.
Detailness-aware caption selection improves model alignment and reconstruction.
Abstract
Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Subtitles and Audiovisual Media · Multimodal Machine Learning Applications
MethodsDeterministic Policy Gradient
