RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection

Fangyi Chen; Han Zhang; Zhantao Yang; Hao Chen; Kai Hu; Marios; Savvides

arXiv:2405.19854·cs.CV·May 31, 2024

RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection

Fangyi Chen, Han Zhang, Zhantao Yang, Hao Chen, Kai Hu, Marios, Savvides

PDF

Open Access 1 Repo

TL;DR

RTGen is a novel method for generating scalable region-text pairs from images, significantly enhancing open-vocabulary object detection by leveraging inpainting and captioning techniques.

Contribution

The paper introduces RTGen, a new approach that generates region-text pairs using image inpainting and captioning, improving open-vocabulary detection performance.

Findings

01

RTGen effectively generates high-quality region-text pairs.

02

Using RTGen data boosts open-vocabulary detection accuracy.

03

RTGen outperforms existing methods in experiments.

Abstract

Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seermer/RTGen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsInpainting · Contrastive Language-Image Pre-training