Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model
Sheng Cheng, Maitreya Patel, Yezhou Yang

TL;DR
This paper investigates how caption precision and recall affect text-to-image model training, finding that precision is more impactful, and demonstrates that synthetic captions generated by large vision-language models can effectively substitute human annotations.
Contribution
The study provides a detailed analysis of caption quality metrics and introduces the use of synthetic captions from large vision-language models for training, showing comparable results to human-annotated data.
Findings
Precision has a greater impact on text-image alignment than recall.
Synthetic captions from vision-language models perform similarly to human-annotated captions.
Using synthetic data can reduce reliance on costly human annotations.
Abstract
Despite advancements in text-to-image models, generating images that precisely align with textual descriptions remains challenging due to misalignment in training data. In this paper, we analyze the critical role of caption precision and recall in text-to-image model training. Our analysis of human-annotated captions shows that both precision and recall are important for text-image alignment, but precision has a more significant impact. Leveraging these insights, we utilize Large Vision Language Models to generate synthetic captions for training. Models trained with these synthetic captions show similar behavior to those trained on human-annotated captions, underscores the potential for synthetic data in text-to-image training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsALIGN
