ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

Cihang Peng; Qiming Hou; Zhong Ren; Kun Zhou

arXiv:2508.01008·cs.CV·August 5, 2025

ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou

PDF

Open Access 1 Datasets

TL;DR

ROVI is a large synthetic dataset for open-vocabulary instance-grounded text-to-image generation, created via a novel re-captioning strategy that enhances detection and description quality, leading to improved model performance.

Contribution

The paper introduces a new re-captioning approach for dataset creation, significantly increasing category diversity and image quality for open-vocabulary detection and generation tasks.

Findings

01

ROVI surpasses existing detection datasets in quality and category diversity.

02

Training GLIGEN on ROVI improves instance grounding accuracy.

03

ROVI enables better prompt fidelity and aesthetic quality in generated images.

Abstract

We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CHang/ROVI
dataset· 183 dl
183 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques