Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Xinran Wang; Muxi Diao; Yuanzhi Liu; Chunyu Wang; Kongming Liang; Zhanyu Ma; Jun Guo

arXiv:2505.15172·cs.CV·May 22, 2025

Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Xinran Wang, Muxi Diao, Yuanzhi Liu, Chunyu Wang, Kongming Liang, Zhanyu Ma, Jun Guo

PDF

Open Access

TL;DR

This paper introduces a new metric for caption detailness based on image coverage and object detailness, improving data efficiency and generation quality in text-to-image models by selecting more informative captions.

Contribution

It proposes a novel metric for caption detailness that outperforms length-based heuristics, enabling more effective data selection for T2I training.

Findings

01

High-ICR and -AOD captions lead to better T2I performance.

02

Training on 20% of data with the new metric surpasses full dataset results.

03

Detailness-aware caption selection improves model alignment and reconstruction.

Abstract

Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Subtitles and Audiovisual Media · Multimodal Machine Learning Applications

MethodsDeterministic Policy Gradient