RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection
Fangyi Chen, Han Zhang, Zhantao Yang, Hao Chen, Kai Hu, Marios, Savvides

TL;DR
RTGen is a novel method for generating scalable region-text pairs from images, significantly enhancing open-vocabulary object detection by leveraging inpainting and captioning techniques.
Contribution
The paper introduces RTGen, a new approach that generates region-text pairs using image inpainting and captioning, improving open-vocabulary detection performance.
Findings
RTGen effectively generates high-quality region-text pairs.
Using RTGen data boosts open-vocabulary detection accuracy.
RTGen outperforms existing methods in experiments.
Abstract
Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsInpainting · Contrastive Language-Image Pre-training
