ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou

TL;DR
ROVI is a large synthetic dataset for open-vocabulary instance-grounded text-to-image generation, created via a novel re-captioning strategy that enhances detection and description quality, leading to improved model performance.
Contribution
The paper introduces a new re-captioning approach for dataset creation, significantly increasing category diversity and image quality for open-vocabulary detection and generation tasks.
Findings
ROVI surpasses existing detection datasets in quality and category diversity.
Training GLIGEN on ROVI improves instance grounding accuracy.
ROVI enables better prompt fidelity and aesthetic quality in generated images.
Abstract
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
