Precision or Recall? An Analysis of Image Captions for Training   Text-to-Image Generation Model

Sheng Cheng; Maitreya Patel; Yezhou Yang

arXiv:2411.05079·cs.CV·November 11, 2024

Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model

Sheng Cheng, Maitreya Patel, Yezhou Yang

PDF

Open Access 1 Repo

TL;DR

This paper investigates how caption precision and recall affect text-to-image model training, finding that precision is more impactful, and demonstrates that synthetic captions generated by large vision-language models can effectively substitute human annotations.

Contribution

The study provides a detailed analysis of caption quality metrics and introduces the use of synthetic captions from large vision-language models for training, showing comparable results to human-annotated data.

Findings

01

Precision has a greater impact on text-image alignment than recall.

02

Synthetic captions from vision-language models perform similarly to human-annotated captions.

03

Using synthetic data can reduce reliance on costly human annotations.

Abstract

Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shengcheng/captions4t2i
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsALIGN