Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary   Object Detection

Yanxin Long; Jianhua Han; Runhui Huang; Xu Hang; Yi Zhu; Chunjing Xu,; Xiaodan Liang

arXiv:2211.00849·cs.CV·August 1, 2023·1 cites

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu,, Xiaodan Liang

PDF

Open Access

TL;DR

This paper introduces a fine-grained visual-text prompt-driven self-training method that significantly improves open-vocabulary object detection by enhancing alignment between visual features and text prompts, achieving state-of-the-art results.

Contribution

It proposes a novel fine-grained alignment approach using learnable text prompts and a visual prompt module to adapt pre-trained VLMs for open-vocabulary detection.

Findings

01

Achieves 31.5% mAP on unseen COCO classes

02

Enhances fine-grained alignment between visual and textual features

03

Outperforms previous state-of-the-art methods in open-vocabulary detection

Abstract

Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling